When we are working with data streaming application then sometimes ,we may find that data is coming as empty. In this situations ,we need to handle such situations we can create an empty RDD and can be saved in another Read more…


Spark RDD can be created in several ways using Scala & Pyspark languages. For example, It can be created by using sparkContext.parallelize() from text file from another RDD DataFrame DataSet Resilient Distributed Datasets (RDD) is the fundamental data structure of Read more…


In this post ,we are going to discuss about reading textFile and CSVFile Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read Read more…


In this post we will learn about sparkContext parallelize Let’s see how to create Spark RDD using sparkContext.parallelize, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical Read more…


This post explains how to setup and run Spark jobs on Hadoop Yarn cluster and will run an spark example on Yarn cluster. Prerequisites : If you don’t have Hadoop & Yarn installed, please Install and Setup Hadoop cluster and setup Yarn on Read more…


This post explains how to setup Yarn master on hadoop 3.1 cluster and run a map reduce program. Before you proceed this document, please make sure you have Hadoop3.1 cluster up and running. if you do not have a setup, Read more…


This documents explains step by step Apache Hadoop installation version (hadoop 3.1.1) with master node (namenode) and 3 worker nodes (datanodes) cluster on Ubuntu. Below are the 4 nodes and it’s IP addresses I will be referring here. 192.168.1.100    Read more…


In this post ,we are going to explain about sparksession. Since Spark 2.0 SparkSession has become an entry point to Spark programming with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will Read more…


Spark default language is Scala. SparkContext (JavaSparkContext for Java) is an entry point to Spark and PySpark to programming with RDD and to connect to Spark Cluster, In this article, you will learn how to create it using examples. What Read more…