Analyzing Taxi Data With Microsoft R Server

Kevin Feasel

2016-12-15

R, Spark

Ali Zaidi builds a Spark cluster to analyze 1.1 billion taxi cab rides using Microsoft R Server:

In a similar spirit to how sparklyr allowed us to reuse our functions from the dplyr package to manipulate Spark DataFrames, the RxSpark API allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.

For a deeper dive into the RevoScaleR package, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.

R isn’t just for small, one-off jobs anymore.

Related Posts

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data: A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner […]

Read More

Visualizing Earthquake Data

Giorgio Garziano continues a series on analyzing earthquake data: This is the third part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show static, interactive and animated earthquakes maps of different flavors by […]

Read More

Categories

December 2016
MTWTFSS
« Nov Jan »
 1234
567891011
12131415161718
19202122232425
262728293031