Analyzing Taxi Data With Microsoft R Server

Kevin Feasel


R, Spark

Ali Zaidi builds a Spark cluster to analyze 1.1 billion taxi cab rides using Microsoft R Server:

In a similar spirit to how sparklyr allowed us to reuse our functions from the dplyr package to manipulate Spark DataFrames, the RxSpark API allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.

For a deeper dive into the RevoScaleR package, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.

R isn’t just for small, one-off jobs anymore.

Related Posts

Using cdata To Created Faceted Plots

Nina Zumel shows how to use the cdata package to create faceted ggplot2 plots: First, load the packages and data: library("ggplot2") library("cdata") iris <- data.frame(iris) Now define the data-shaping transform, or control table. The control table is basically a picture that sketches out the final data shape that I want. I want to specify the x and y columns of the plot […]

Read More

Looking At Databricks Cluster Pricing

Tristan Robinson takes a look at Azure Databricks pricing: The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms. To run any […]

Read More


December 2016
« Nov Jan »