Resilient Distributed Datasets

Kevin Feasel



Spark is built around the concept of Resilient Distributed Datasets.  If you have not read Matei Zaharia, et al’s paper on the topic, I highly recommend it:

Spark exposes RDDs through a language-integrated API similar to DryadLINQ [31] and FlumeJava [8], where each dataset is represented as an object and transformations are invoked using methods on these objects.

Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. Examples of actions include count (which returns the number of elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset to a storage system). Like DryadLINQ, Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations.

In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist. Finally, users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

The link also has a video of their initial presentation at NSDI.  Check it out.

Related Posts

Data Wrangling At Scale

John Mount has a short article showing off the cdata package: Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”. With the cdata package this transform is easy and performant, as we show below. Read the whole thing.

Read More

Connect(); Announcements, Including Azure Databricks

James Serra has a wrapup of Microsoft Connect(); announcements around the data platform space: Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made.  Here is a summary of the data platform related announcements: Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. […]

Read More


May 2016
« Apr Jun »