Spark Data Frame Checkpoints

Kevin Feasel



Jean Georges Perrin introduces checkpoints on Spark data frames:

Basically, I use a checkpoint if I want to freeze the content of my data frame before I do something else. It can be in the scenario of iterative algorithms (as mentioned in the Javadoc) but also in recursive algorithms or simply branching out a data frame to run different kinds of analytics on both.

Spark has been offering checkpoints on streaming since earlier versions (at least v1.2.0), but checkpoints on data frames are a different beast.

This could also be very useful for a quality control flow:  perform operation A, and if it doesn’t generate good enough results, roll back and try operation B.

Related Posts

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown: RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all […]

Read More

How Qubole Optimizes Apache Spark Clusters

Mikhail Stolpner gives us some tips on how to optimize Apache Spark clusters: There are four major resources: memory, compute (CPU), disk, and network. Memory and compute are by far the most expensive. Understanding how much compute and memory your application requires is crucial for optimization. You can configure how much memory and how many […]

Read More


February 2017
« Jan Mar »