Exploring Taxi Data

Kevin Feasel


Hadoop, R

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Related Posts

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3: The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain. […]

Read More

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More


April 2016
« Mar May »