Spark

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark Setup

  1. spark-download-and-test/
  2. Spark-step-by-step-setup-on-hadoop-yarn-cluster/

Spark Initial Understanding

  1. What-is-sparkcontext-explained-with-code
  2. Sparksession-explained-with-examples

Spark RDD Understanding

  1. Creating-a-spark-rdd-using-parallelize-method
  2. Spark-can-we-read-multiple-text-files-into-single-rdd
  3. Different-ways-to-create-spark-rdd
  4. How-to-create-an-empty-rdd
  5. Spark-rdd-transformations-with-examples
  6. Spark-rdd-actions-with-examples
  7. Spark-pair-rdd-functions-explanation
  8. Repartition-vs-coalesce-explanation
  9. Spark-shuffle-partitions
  10. Spark-persistance-storage-levels
  11. Spark-rdd-cache-and-persist-with-example
  12. Spark-broadcast-variables
  13. Spark-accumulators
  14. Convert-spark-rdd-to-dataframe-dataset

Spark Data Source API

References

https://spark.apache.org/docs/latest/index.html