Analyzing Clickstream Data With Spark

Tony Cruz and Denny Lee analyze advertising data in Spark and predict click counts given certain input features:

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle.  The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

  • ETL

  • Data Exploration, for example, using SQL

  • Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Related Posts

Overriding Spark Dependencies

Landon Robinson shows how to override a Spark dependency located on the classpath: This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:– gson-2.2.4.jar: the method is private, and therefore too old for use here– gson-2.6.1: the method is public, and works fine.– Somewhere between the two, the […]

Read More

Kafka and MirrorMaker

Renu Tewari describes what MirrorMaker does for Kafka today and what is coming with version 2: Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking clickstream event data, collecting logs, gathering metrics, and being the enterprise data bus in a microservices based architectures. Kafka is essentially a highly […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031