Author: theprogrammersbook


Prerequisite Step 1: Download the Spark the from the following URL: Go to the following URL: https://spark.apache.org/downloads.html Her we need to select the spark version(the verson on which we are going to work) and package type (compatible version with hadoop) Read more…


nagaraju@nagaraju:~$ cut -d: -f1 /etc/passwd rootdaemonbinsyssyncgamesman nagaraju@nagaraju:~$ cut -d: -f1 /etc/group rootdaemonbinsysadmttydisklpmailnewsuucpmanpro


What:   Least Common Multiplier How: Example: LCM of 15 and 20 15 Prime Factors are : 5*3 20 Prime Factors : 2*2*5 So, the Union of the above : 2,2,5,3 . The Multiplier value is : 2*2*5*3 = 60. Real Time Example: L.C.M. Read more…


Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters Read more…


Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters Read more…


Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters Read more…


While working in Apache Spark with Scala, we often need to convert RDD to DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame is a distributed collection of data organised into named columns similar to Database Read more…


Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations Spark by default supports to create an accumulators of any numeric type Read more…


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark Read more…


Spark Cache and Persist are optimization techniques to improve the performance of the RDD jobs that are iterative and interactive. In this article, you will learn what is Cache and Persist, how to use it on RDD, understanding the difference between Caching and Persistence and how to use Read more…