Aggregating Clickstream Data

Kevin Feasel

2016-06-10

Spark

Ofer Habushi solves a clickstream aggregation problem using Spark:

At this point, an interesting question came up for us: How can we keep the data partitioned and sorted? 

That’s a challenge. When we sort the entire data set, we shuffle in order to get sorted RDDs and create new partitions, which are different than the partitions we got from Step 1. And what if we do the opposite?

Sort first by creation time and then partition the data? We’ll encounter the same problem. The re-partitioning will cause a shuffle and we’ll lose the sort. How can we avoid that?

Partition→sort = losing the original partitioning

Sort→partition = losing the original sort

There’s a solution for that in Spark. In order to partition and sort in Spark, you can use repartitionAndSortWithinPartitions. 

This is an interesting solution to an ever-more-common problem.

Related Posts

Connect(); Announcements, Including Azure Databricks

James Serra has a wrapup of Microsoft Connect(); announcements around the data platform space: Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made.  Here is a summary of the data platform related announcements: Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. […]

Read More

Getting Started With Zeppelin

Sangeeta Gulia shows us how to get started building notebooks with Apache Zeppelin on top of Spark: There are 3 interpreter modes available in Zeppelin. 1) Shared Mode In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930