Persistance is the continued or prolonged existence of something, here we are considering DB as persistance place. All different persistence (persist() method) storage level Spark supports are available at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to persist or cache Read more…


The Spark shuffle is a mechanism for redistributing or re-partitioning data so that the data grouped differently across partitions. Spark shuffle is a very expensive operation as it moves the data between executors of same worker node or even between Read more…


In Spark or PySpark repartition is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the Spark coalesce is used to only decrease the number of partitions in an efficient way. In this post, we will learn what Read more…


Spark has PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair, In this post, we will learn those functions with Scala examples. All these functions are grouped into Transformations and Actions similar to regular RDD’s. Spark Pair Read more…


RDD actions are operations that return the raw values, In other words, any RDD function that returns other than RDD[T] is considered as an action in spark programming.  In this tutorial, we will learn RDD actions with Scala examples. As Read more…


RDD Transformations are Spark operations when executed on RDD, it results in a single or multiple new RDD’s. Since RDD are immutable in nature, transformations always create new RDD without updating an existing one. Hence, this creates an RDD lineage. RDD Read more…


When we are working with data streaming application then sometimes ,we may find that data is coming as empty. In this situations ,we need to handle such situations we can create an empty RDD and can be saved in another Read more…


Spark RDD can be created in several ways using Scala & Pyspark languages. For example, It can be created by using sparkContext.parallelize() from text file from another RDD DataFrame DataSet Resilient Distributed Datasets (RDD) is the fundamental data structure of Read more…


In this post ,we are going to discuss about reading textFile and CSVFile Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read Read more…


In this post we will learn about sparkContext parallelize Let’s see how to create Spark RDD using sparkContext.parallelize, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical Read more…