Data Frame Partial Caching

Kevin Feasel



Arijit Tarafdar shows how to capture partitions of a data frame in Spark, either horizontally or vertically:

In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.

In the notebook we declare a Student case class with name, subject, major, school and year as members. The application is required to find out the number of students by name, subject, major, school and year.

Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.

Related Posts

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive: If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support. In this […]

Read More

Unit Testing Spark Streaming DStreams

Anuj Saxena shows how to create unit tests for DStreams in Spark Streaming: The method ‘ testOperation ‘ takes the output of the operation performed on the ‘inputPair’ and check whether it is equal to the ‘outputPair’ and just like this, we can test our business logic. This short snippet lets you test your business logic without […]

Read More


July 2016
« Jun Aug »