Category: Spark


Prerequisite Step 1: Download the Spark the from the following URL: Go to the following URL: https://spark.apache.org/downloads.html Her we need to select the spark version(the verson on which we are going to work) and package type (compatible version with hadoop) Read more…


While working in Apache Spark with Scala, we often need to convert RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organised into named columns similar to Database Read more…


Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations Spark by default supports to create an accumulators of any numeric type Read more…


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark Read more…


Spark Cache and Persist are optimization techniques to improve the performance of the RDD jobs that are iterative and interactive. In this article, you will learn what is Cache and Persist, how to use it on RDD, understanding the difference between Caching and Persistence and how to use Read more…


Persistance is the continued or prolonged existence of something, here we are considering DB as persistance place. All different persistence (persist() method) storage level Spark supports are available at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to persist or cache Read more…


The Spark shuffle is a mechanism for redistributing or re-partitioning data so that the data grouped differently across partitions. Spark shuffle is a very expensive operation as it moves the data between executors of same worker node or even between Read more…


In Spark or PySpark repartition is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the Spark coalesce is used to only decrease the number of partitions in an efficient way. In this post, we will learn what Read more…


Spark has PairRDDFunctions class with several functions to work with Pair RDD or RDD key-value pair, In this post, we will learn those functions with Scala examples. All these functions are grouped into Transformations and Actions similar to regular RDD’s. Spark Pair Read more…


RDD actions are operations that return the raw values, In other words, any RDD function that returns other than RDD[T] is considered as an action in spark programming.  In this tutorial, we will learn RDD actions with Scala examples. As Read more…