Exploring Taxi Data

Kevin Feasel


Hadoop, R

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Related Posts

Exploratory Data Analysis with inspectdf

Laura Ellis continues a dive into Exploratory Data Analysis, this time using the inspectdf package: I like this package because it’s got a lot of functionality and it’s incredibly straightforward to use. In short, it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Better yet, […]

Read More

MRAN Changes and a Survey

David Smith discusses potential changes to MRAN: As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We’re contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven’t been used. But before we do that we’d  like to hear from […]

Read More


April 2016
« Mar May »