Spark Persistance Storage Levels


Persistance is the continued or prolonged existence of something, here we are considering DB as persistance place.

All different persistence (persist() method) storage level Spark supports are available at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to persist or cache a Spark RDD, DataFrame and Dataset.

All these Storage levels are passed as an argument to the persist() method of the Spark/Pyspark RDD, DataFrame and Dataset.

For example

Here, I will describe all storage levels available in Spark.

Memory only Storage level

StorageLevel.MEMORY_ONLY is the default behavior of the RDD cache() method and stores the RDD or DataFrame as deserialized objects to JVM memory. When there is no enough memory available it will not save DataFrame of some partitions and these will be re-computed as and when required.

This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level as it recomputes the unsaved partitions and recomputing the in-memory columnar representation of the underlying table is expensive.

Serialize in Memory

StorageLevel.MEMORY_ONLY_SER is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.

Memory only and Replicate

StorageLevel.MEMORY_ONLY_2 is same as MEMORY_ONLY storage level but replicate each partition to two cluster nodes.

Serialized in Memory and Replicate

StorageLevel.MEMORY_ONLY_SER_2 is same as MEMORY_ONLY_SER storage level but replicate each partition to two cluster nodes.

Memory and Disk Storage level

StorageLevel.MEMORY_AND_DISK is the default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions into a disk and reads the data from disk when it required. It is slower as there is I/O involved.

Serialize in Memory and Disk

StorageLevel.MEMORY_AND_DISK_SER is same as MEMORY_AND_DISK storage level difference being it serializes the DataFrame objects in memory and on disk when space not available.

Memory, Disk and Replicate

StorageLevel.MEMORY_AND_DISK_2 is Same as MEMORY_AND_DISK storage level but replicate each partition to two cluster nodes.

Serialize in Memory, Disk and Replicate

StorageLevel.MEMORY_AND_DISK_SER_2 is same as MEMORY_AND_DISK_SER storage level but replicate each partition to two cluster nodes.

Disk only storage level

In StorageLevel.DISK_ONLY storage level, DataFrame is stored only on disk and the CPU computation time is high as I/O involved.

Disk only and Replicate

StorageLevel.DISK_ONLY_2 is same as DISK_ONLY storage level but replicate each partition to two cluster nodes.

When to use what?

Below is the table representation of the Storage level, Go through the impact of space, CPU, and performance choose the one that best fits for you.

Reference

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *