Tag: RDD


While working in Apache Spark with Scala, we often need to convert RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organised into named columns similar to Database Read more…


Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations Spark by default supports to create an accumulators of any numeric type Read more…


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark Read more…


Spark Cache and Persist are optimization techniques to improve the performance of the RDD jobs that are iterative and interactive. In this article, you will learn what is Cache and Persist, how to use it on RDD, understanding the difference between Caching and Persistence and how to use Read more…


The Spark shuffle is a mechanism for redistributing or re-partitioning data so that the data grouped differently across partitions. Spark shuffle is a very expensive operation as it moves the data between executors of same worker node or even between Read more…


In Spark or PySpark repartition is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the Spark coalesce is used to only decrease the number of partitions in an efficient way. In this post, we will learn what Read more…


When we are working with data streaming application then sometimes ,we may find that data is coming as empty. In this situations ,we need to handle such situations we can create an empty RDD and can be saved in another Read more…


Spark RDD can be created in several ways using Scala & Pyspark languages. For example, It can be created by using sparkContext.parallelize() from text file from another RDD DataFrame DataSet Resilient Distributed Datasets (RDD) is the fundamental data structure of Read more…


In this post ,we are going to discuss about reading textFile and CSVFile Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read Read more…