Tag: DataFrame


While working in Apache Spark with Scala, we often need to convert RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organised into named columns similar to Database Read more…


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark Read more…


The Spark shuffle is a mechanism for redistributing or re-partitioning data so that the data grouped differently across partitions. Spark shuffle is a very expensive operation as it moves the data between executors of same worker node or even between Read more…


In Spark or PySpark repartition is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the Spark coalesce is used to only decrease the number of partitions in an efficient way. In this post, we will learn what Read more…


Spark RDD can be created in several ways using Scala & Pyspark languages. For example, It can be created by using sparkContext.parallelize() from text file from another RDD DataFrame DataSet Resilient Distributed Datasets (RDD) is the fundamental data structure of Read more…