In this post, we will show you a visualization and build a predictive model of US flights with sparklyr. Flight visualization code is based on this article.
This post assumes you already have the following tables:
- Airlines data as
airlines_bi_pq. It is assumed to be on S3, but you can put it into HDFS. See also the Ibis project.
- Airports data converted into Parquet format as
airports_new_pq. See also 2009 ASA Data Expo.
You should make these tables available through Apache Hive or Apache Impala (incubating) with Hue.
There’s some setup work to get this going, but getting a handle on sparklyr looks to be a good idea if you’re in the analytics space.
Maps are great for practicing data visualization. First of all, there’s a lot of data available on places like Wikipedia that you can map.
Moreover, creating maps typically requires several essential skills in combination. Specifically, you commonly need to be able to retrieve the data (e.g., scrape it), mold it into shape, perform a join, and visualize it. Because creating maps requires several skills from data manipulation and data visualization, creating them will be great practice for you.
And if that’s not enough, a good map just looks great. They’re visually compelling.
With that in mind, I want to walk you through the logic of building one step by step.
Read on for a step by step process.
Radiohead is known for having some fairly maudlin songs, but of all of their tracks, which is the most depressing? Data scientist and R enthusiast Charlie Thompson ranked all of their tracks according to a “gloom index”, and created the following chart of gloominess for each of the band’s nine studio albums. (Click for the interactive version, crated with with highcharter package for R, which allows you to explore individual tracks.)
Do click through for Charlie’s explanation, including where the numbers come from.
In this module you will learn how to use the Waffle Chart Power BI Custom Visual. The Waffle Chart visual is most useful for presenting a percentage of data. This chart is a great option to choose over other visuals like Pie Charts, which are not great at showing proportions of data.
Waffle charts are infographic-friendly visuals; they’re easy to read and as long as you don’t have too many categories, easy to compare.
This visual is a mixture between a 100% stacked column chart and a 100% stacked bar chart.
The width of a column is proportional to the total value of the column.
With a relatively small number of groups for both columns and rows, this is a good way of getting a feel for relative weights across two dimensions.
This release also adds support for Spark 2 including version Spark 2.1. Zeppelin now also links to Spark History Server UI from Zeppelin so users can more easily track Spark jobs. The Livy interpreter now supports specifying packages with the job.
The major security improvement in Zeppelin 0.7.0 is using Apache Knox’s LDAP Realm to connect to LDAP. Zeppelin home page now lists only the nodes to which the user is authorized to access. Zeppelin now also has the ability to support PAM based authentication.
The full list of improvements is available here
This visualization platform is growing up nicely.
The Dual KPI efficiently visualizes two measures over time. It shows their trend based on a joint timeline, while absolute values may use different scales, for example Profit and Market share or Sales and Profit.
Each KPI can be visualized as line chart or area chart. The visual has dynamic behavior and can show historical value and the change from the latest value when you hover over it. It also has small icons and labels to convey KPI definitions and alerts about data freshness.
I looks cool, but I dunno; my philosophy is that man cannot serve two KPIs.
In this module you will learn how to use the Gap Analysis Power BI Custom Visual. The Gap Analysis visual is used to analyze the difference between two different groups of data you have. For example, you might use it to analyze the gap between two answers people gave in survey response data.
I like the gap analysis visual; it works well as a cross-category comparison visual, giving you an idea of the relative importance of each category as well as the change from one time period to the next. It’s a good way of fitting two useful pieces of information into the same visual.
If you’ve learned the basics of data visualization in R (namely, ggplot2) and you’re interested in geospatial visualization, use this as a small, narrowly-defined exercize to practice some intermediate skills.
There are at least three things that you can learn and practice with this visualization:
Learn about color: Part of what makes this visualization compelling are the colors. Notice that in the area surrounding the US, we’re not using pure black, but a dark grey. For the title, we’re not using white, but a medium grey. Also, notice that for the rivers, we’re not using “blue” but a very specific hexadecimal color. These are all deliberate choices. As an exercise, I highly recommend modifying the colors. Play around a bit and see how changing the colors changes the “feel” of the visualization.
Learn to build visualizations in layers: I’ve emphasized this several times recently, but layering is an important principle of data visualization. Notice that we’re layering the river data over the USA country map. As an exercise, you could also layer in the state boundaries between the country map and the rivers. To do this, you can use map_data().
Learn about ‘Spatial’ data: R has several classes for dealing with ‘geospatial’ data, such as ‘SpatialLines‘, ‘SpatialPoints‘, and others. Spatial data is a whole different animal, so you’ll have to learn its structure. This example will give you a little experience dealing with it.
I also like the iterative approach they discuss. You’ll almost never get it right the first go-around, but one of the nice things about ggplot2 is that it’s designed to be iterative: you layer your changes on, making it a bit easier to fiddle with them to get the visualization just right.
Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!
You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.
Sparklyr is an exciting technology for distributed data analysis.