Analyzing Clickstream Data With Spark

Tony Cruz and Denny Lee analyze advertising data in Spark and predict click counts given certain input features:

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle.  The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

  • ETL

  • Data Exploration, for example, using SQL

  • Advanced Analytics / Machine Learning

The Databricks blog has a couple other examples, but this was the most interesting one for me.

Related Posts

Dealing With Multicollinearity With R

Chaitanya Sagar explains the concept of multicollinearity in linear regressions and how we can mitigate this issue in R: Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, […]

Read More

When Cassandra Makes Sense

Anmol Sarna explains the pros and cons of using Apache Cassandra: But as we know nothing is perfect. So is the Cassandra Database. What I mean by this is that you cannot have a perfect package. If you wish for one brilliant feature then you might have to compromise on the other features. In today’s […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031