In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes areĀ application specific and areĀ beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.
In the notebook we declare a
Student case class
withname
,subject
,major
,school
andyear
as members. The application is required to find out the number of students byname
,subject
,major
,school
andyear
.
Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.