Spark – Can we read multiple text files into single RDD?


In this post ,we are going to discuss about reading textFile and CSVFile

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

textFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

In this article let’s see some examples with both of these methods using Scala and PySpark languages.

  • Read all text files from a directory into a single RDD
  • Read multiple text files into a single RDD
  • Read all text files matching a pattern to single RDD
  • Read files from multiple directories into single RDD
  • Reading text files from nested directories into Single RDD
  • Reading all text files separately and union to create a Single RDD
  • Reading CSV files

Before we start, let’s assume we have the following file names and file contents at folder “c:/tmp/files” and I use these files to demonstrate the examples.

FILE NAMEFILE CONTENTS
text01.txtOne,1
text02.txtTwo,2
text03.txtThree,3
text04.txtFour,4
invalid.txtInvalid,I
DataTable

Spark Read all text files from a directory into a single RDD

In Spark, by inputting path of the directory to the textFile() method reads all text files and creates a single RDD. Make sure you do not have a nested directory If it finds one Spark process fails with an error.

This example reads all files from a directory, creates a single RDD and prints the contents of the RDD.

If you are running on a cluster you should first collect the data in order to print on a console as shown below.

Let’s see a similar example with wholeTextFiles() method. note that this returns an RDD[Tuple2]. where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

Yields below output.

Spark Read multiple text files into a single RDD

When you know the names of the multiple files you would like to read, just input all file names with comma separator in order to create a single RDD.

This read file text01.txt & text02.txt files and outputs below content.

Read all text files matching a pattern to single RDD

textFile() method also accepts pattern matching and wild characters. For example below snippet read all files start with text and with the extension “.txt” and creates single RDD.

Yields below output.

Read files from multiple directories into single RDD

It also supports reading files and multiple directories combination.

Yields below output

Reading text files from nested directories into Single RDD

textFile() and wholeTextFile() returns an error when it finds a nested folder hence, first using scala, Java, Python languages create a file path list by traversing all nested folders and pass all file names with comma separator in order to create a single RDD.

Reading all text files separately and union to create a Single RDD

You can also read all text files into a separate RDD’s and union all these to create a single RDD.

Reading multiple CSV files into RDD

Spark RDD’s doesn’t have a method to read csv file formats hence we will use textFile() method to read csv file like any other text file into RDD and split the record based on comma, pipe or any other delimiter.

Here, we read all csv files in a directory into RDD, we apply map transformation to split the record on comma delimiter and a map returns another RDD “rdd6” after transformation. finally, we iterate rdd6, reads the column based on an index.

Note: You can’t update RDD as they are immutable. this example yields the below output.

Complete code

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *