Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Related Posts

Defending Pie Charts

Bobby Johnson makes a valiant effort at defending the indefensible: In the world of data analysis, there are few things more reviled than the pie chart. Among “serious” data people, it is at best trivial and naive, and at worst downright evil. I do not agree with this. The pie chart is simple, but that […]

Read More

Cassandra To Kafka Connect

Mike Barlotta shows how to feed data into Kafka from Cassandra via Kafka Connect.  Part one involves basic setup: Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our […]

Read More


February 2017
« Jan Mar »