Press "Enter" to skip to content

Basic Spark Terminology

Denny Lee and Jules Damji explain some of the key termsĀ and concepts around Apache Spark:

At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

Some of these concepts are new to Spark 2.0, but all are worth learning.