In the world of data analysis, there are few things more reviled than the pie chart. Among “serious” data people, it is at best trivial and naive, and at worst downright evil.
I do not agree with this. The pie chart is simple, but that is its beauty. It does exactly one thing and it does it well: it shows you how much different parts contribute to a whole. This isn’t the only question you ever have about your data, but when it’s the question you do have, the pie chart is perfect. That is not evil and it is not naive. It is data visualization doing what it should: taking something large and abstract and saying something simple about it that your brain can easily internalize.
I strongly disagree with arguments in the article, but do respect the attempt. In each of the cases, at least one of a bar chart, stacked 100% bar chart, or dot plot could give at least the same amount of information with less lower mental overhead.
We have in the middle an open source time series database called InfluxDBis designed for collecting data that is timestamped such as performance metrics. Into that, we feed data from an open source project called Telegraf which can feed in more than just SQL Server statistics. And to be able to show us the data in nice pretty graphs that we can manipulate, drill-down on, and even set up alerts we display it using Grafana. Links to all of these products you find as we go through the setup of the solution.
Tracy’s post is dedicated to installation and configuration more than defining metrics, but it does get you on the road to custom metrics visualization.
First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.
radar_datadata frame starts out simple: build up some stats by continent. Then I call the
mutate_each_function to call
standardizefor each variable in the
mutate_each_is deprecated and I should use something different like
mutate_at, but this does work in the current version of ggplot2 at least.
Finally, I call the
ggradar()function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.
It was a lot of fun putting this series together. I think the most important part of the series was learning just how easy ggplot2 is once you sit down and think about it in a systemic manner.
shinyalert()with the desired arguments, such as a title and text, and a modal will show up. In order to be able to call
shinyalert()in a Shiny app, you must first call
useShinyalert()anywhere in the app’s UI.
From the plots above I find that regardless the different levels of diastolic and systolic blood pressure there is no substantial correlation between cholesterol and blood pressure. However, it is better to build the correlation line with
geom_smoothor to calculate the Spearman correlation, although in this post we focus only on the visualization.
Lets build the correlation line.
Click through for several examples of visuals.
Notice that I used geom_path(). This is a geom I did not cover earlier in the series. It’s not a common geom, though it does show up in charts like this where we want to display data for three variables. The geom_line() geom follows the basic rules for a line: that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).
But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in. The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable. The trick is, though, that we don’t define this third variable—it’s implicit in the data set order. In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting
data = arrange(global_avg, m_lifeExp). Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set. So let’s look at a place where that’s not true.
The cowplot library gives you an easier way of linking together different plots of different sizes in a couple lines of code, which is much easier than using ggplot2 by itself.
Notice that we create a graph per continent by setting
facets = ~continent. The tilde there is important—it’s a one-sided formula. You could also write
c("continent")if that’s clearer to you.
I also set the number of columns, guaranteeing that we see no more than 3 columns of grids. I could alternatively set
nrow, which would guarantee we see no more than a certain number of rows.
There are a couple other interesting features in facet_wrap. First, we can set
scales = "free"if we want to draw each grid as if the others did not exist. By default, we use a scale of “fixed” to ensure that everything plots on the same scale. I prefer that for this exercise because it lets us more easily see those continental clusters.
Facets let you compare multiple graphs quickly. They’re great for fast comparison, but as I show in the post, you can distort the way the data looks by lining it up horizontally or vertically.
You are not limited to using defaults in your graphs. Let’s go back to the minimal theme but change the fonts a bit. I want to make the following changes:
Use Gill Sans fonts instead of the default
Increase the title font size a little bit
Decrease the X axis font size a little bit
Remove the Y axis; the subtitle makes it clear what the Y axis contains
By the time we’re through this, we have publication-quality visuals in a few dozen lines of code. I also have provided a bonus rant on Windows and R and fonts because that’s a nasty experience.
Annotations are useful for marking out important comments in your visual. For example, going back to our wealth and longevity chart, there was a group of Asian countries with extremely high GDP but relatively low average life expectancy. I’d like to call out that section of the visual and will use an annotation to do so. To do this, I use the annotate() function. In this case, I’m going to create a text annotation as well as a rectangle annotation so you can see exactly the points I mean.
By this point, we’re getting closer and closer to high-quality graphics.
The other thing I want to cover today is coordinate systems. The ggplot2 documentation shows seven coordinate functions. There are good reasons to use each, but I’m only going to demonstrate one. By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space. This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.
The special coordinate system I want to point out is coord_flip, which flips the X and Y axes. This allows us, for example, to turn a column chart into a bar chart. Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.
There are a lot of pictures and more step-by-step work. Most of these are still 3-4 lines of code, so again, pretty simple.