Caching

Although the contents of RDDs are transient by default, Spark provides a mechanism for persisting the data in an RDD. After the first time an action requires computing such an RDD’s contents, they are stored in memory or disk across the cluster. The next time an action depends on the RDD, it need not be recomputed from its dependencies. Its data is returned from the cached partitions directly

Spark defines a few different mechanisms, or StorageLevel values, for persisting RDDs.

StorageLevel Describe Advantage Drawback
MEMORY
MEMORY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER