How to create an empty RDD?


When we are working with data streaming application then sometimes ,we may find that data is coming as empty. In this situations ,we need to handle such situations we can create an empty RDD and can be saved in another data source.

We often need to create empty RDD in Spark. Empty RDD can be created in several ways, for example, with partition, without partition, and with pair RDD.

In this article, we will see these with Scala, Java and Pyspark examples.

  • Spark sc.emptyRDD – Creates empty RDD with no partition
  • Create an Empty RDD with Partition
  • Creating an Empty pair RDD
  • Java – creating an empty RDD
  • PySpark – creating an empty RDD

Spark sc.emptyRDD – Creates empty RDD with no partition

In Spark, using emptyRDD() function on the SparkContext object creates an empty RDD with no partitions or elements.

The below examples create an empty RDD.

From the above spark.sparkContext.emptyRDD creates an EmptyRDD[0] and spark.sparkContext.emptyRDD[String] creates EmptyRDD[1] of String type. And both of these empty RDD’s created with 0 partitions. Statements println() from this example yields below output.

Note that writing an empty RDD creates a folder with ._SUCCESS.crc file and _SUCCESS file with zero size.

Once we have empty RDD, we can easily create an empty DataFrame from rdd object.

Create an Empty RDD with Partition

Using Spark sc.parallelize() we can create an empty RDD with partitions, writing partitioned RDD to a file results in the creation of multiple part files.

From the above spark.sparkContext.parallelize(Seq.empty[String]) creates an ParallelCollectionRDD[2] with 3 partitions.

Here is another example using sc.parallelize()

Creating an Empty pair RDD

Most we use RDD with pair hence, here is another example of creating an RDD with pair. This example creates an empty RDD with String & Int pair.

Yields below output.

Java – creating an empty RDD

Similar to Scala, In Java also we can create an empty RDD by call emptyRDD() function on JavaSparkContext object.

PySpark – creating an empty RDD

Complete example in Scala

References

https://spark.apache.org/docs/latest/index.html

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *