Aggregating Clickstream Data

Kevin Feasel



Ofer Habushi solves a clickstream aggregation problem using Spark:

At this point, an interesting question came up for us: How can we keep the data partitioned and sorted? 

That’s a challenge. When we sort the entire data set, we shuffle in order to get sorted RDDs and create new partitions, which are different than the partitions we got from Step 1. And what if we do the opposite?

Sort first by creation time and then partition the data? We’ll encounter the same problem. The re-partitioning will cause a shuffle and we’ll lose the sort. How can we avoid that?

Partition→sort = losing the original partitioning

Sort→partition = losing the original sort

There’s a solution for that in Spark. In order to partition and sort in Spark, you can use repartitionAndSortWithinPartitions. 

This is an interesting solution to an ever-more-common problem.

Related Posts

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark: Time series analysis has two components: time series manipulation and time series modeling. Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data […]

Read More

ElasticMapReduce And RStudio

Tanzir Musabbir demonstrates how to set up Amazon ElasticMapReduce to include an RStudio edge node: RStudio Server provides a browser-based interface for R and a popular tool among data scientists. Data scientist use Apache Spark cluster running on  Amazon EMR to perform distributed training. In a previous blog post, the author showed how you can install RStudio Server on Amazon […]

Read More


June 2016
« May Jul »