Exploring Taxi Data

Kevin Feasel

2016-04-20

Hadoop, R

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Related Posts

Using wrapr For A Consistent Pipe With ggplot2

John Mount shows how you can use the wrapr pipe to perform data processing and building a ggplot2 visual: Now we can run a single pipeline that combines data processing steps and ggplot plot construction. data.frame(x = 1:20) %.>% mutate(., y = cos(3*x)) %.>% ggplot(., aes(x = x, y = y)) %.>% geom_point() %.>% geom_line() %.>% ggtitle("piped ggplot2") Check […]

Read More

Using R To Hit Azure ML From Power BI

Leila Etaati shows how you can use R to hit an Azure ML endpoint to populate a data set in Power BI: You need to create a model in Azure ML Studio and create a web service for it. The traditional example in Predict a passenger on Titanic ship is going to survived or not? […]

Read More

Categories

April 2016
MTWTFSS
« Mar May »
 123
45678910
11121314151617
18192021222324
252627282930