Press "Enter" to skip to content

Author: Kevin Feasel

A Layered Grammar Of Graphics

Hadley Wickham describes some of the decisions he made when putting together ggplot2:

In the examples above, we have seen some of the components that make up a plot:
• data and aesthetic mappings,
• geometric objects,
• scales, and
• facet specification.
We have also touched on two other components:
• statistical transformations, and
• the coordinate system.
Together, the data, mappings, statistical transformation, and geometric object form a layer. A plot may have multiple layers, for example, when we overlay a scatterplot with a smoothed line.

This isn’t an article about how to use ggplot2; rather, it’s an article about implementation decisions.  To that end, I think it’s useful to see some of the logic behind ggplot2’s decisions.

Comments closed

The Grammar Of Graphics

Leland Wilkinson has written the book on how we should write systems which visualize data:

This book was written for statisticians, computer scientists, geographers, research and applied scientists, and others interested in visualizing data. It presents a unique foundation for producing almost every quantitative graphic found in scientific journals, newspapers, statistical packages, and data visualization systems. This foundation was designed for a distributed computing environment (Internet, Intranet, client-server), with special attention given to conserving computer code and system resources.

There’s no free copy of this book, and it’s a very expensive textbook. For most people, you’ll get more from derivative works, but if you’ve thought about putting together a graphics library, this is a must-read.

Comments closed

Data Visualization For Social Science

I’ve started reading Kieran Healy’s book, Data Visualization For Social Science.  He has a free draft available online, and it automatically builds nightly so you’re seeing the latest version.  From the preface:

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.

After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham 2016a). The ggplot2 library implements a “grammar of graphics” (Wilkinson 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.

Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.

I’m less than halfway through the book so far, but it is quite an approachable look at the ggplot2 library with a bit of discussion on what makes for quality graphics.

Comments closed

Fill And Highlight Countries On A Map

The folks at Sharp Sight Labs show how to create a professional-looking filled map in R, including highlighting specific countries:

Let’s point out a few things.

First, the fill color scale has been carefully crafted to optimally show differences between countries.

Second, we are simultaneously using the highlighting technique to highlight the OPEC countries.

Finally, notice that we’re using the title to “tell a story” about the highlighted data.

All told, there is a lot going on in this example.

It’s a very interesting example of building higher-quality visuals in R.  I also give them kudos for picking five colors which work for people with every kind of Color Vision Deficiency.  H/T R-Bloggers

Comments closed

Customer Retention Analysis With SQL

Luba Belokon walks through some sample customer retention analysis queries written in SQL:

Customer retention curves are essential to any business looking to understand its clients and will go a long way towards explaining other things like sales figures or the impact of marketing initiatives. They are an easy way to visualize a key interaction between customers and the business, which is to say, whether or not customers return — and at what rate — after the first visit.

The first step to building a customer retention curve is to identify those who visited your business during the reference period, what I will call p1. It is important that the length of the period chosen is a reasonable one, and reflects the expected frequency of visits.

Different types of businesses are going to expect their customers to return at different rates:

  • A coffee shop may choose to use an expected frequency of visits of once a week.

  • A supermarket may choose a longer period, perhaps two weeks or one month.

In this case, I think the motivation portion is better than the queries themselves, but the article definitely works as an inspiration for building out good measures of frequency of occurrence.

Comments closed

What’s New With KSQL

Hojjat Jafarpour announces Kafka’s KSQL version 0.3:

Additionally, we have taken the first steps to provide metrics and observability in KSQL. This greatly enhances the operability of KSQL, like in cases where you’re monitoring KSQL capacity or when diagnosing issues. You can now see different metrics for streams, tables, and queries for every KSQL server instance.

For streams and tables, we now have DESCRIBE EXTENDED <stream/table name> statement to show statistics, such as number of messages processed per second, total messages, the time when the last message was received, as well as corresponding failure metrics.

Looks like they’re building it out a piece at a time.

Comments closed

Running R Scripts In Power BI

Mark Vaillancourt shows how to run an R script inside Power BI Desktop:

All of the options I will show require you to have R installed on your machine. I am using R version 3.4.3 I got here as well as R Studio (an IDE: Integrated Scripting Environment) version 1.1.383 I obtained here. You can also use Microsoft R Open, which you can get here. All are free. I am choosing base R and R Studio because I want to play with/show the use of non-Microsoft tools in conjunction with Microsoft tools. I am using 2.53.4954.481 64-bit (December 2017) of Power BI Desktop. Note that things could look/behave differently in other version of Power BI Desktop.

For this post, I am using a well-known dataset known as the Iris dataset, which you can read about here. I downloaded the zip file from here to obtain a csv file of the data set for one of my examples. The Iris dataset is also included in the “datasets” package in R Studio, which I will use as well.

Note: A key R concept to understand is that of a data frame, which is essentially just data in a tabular format. In a data frame, the “columns” are actually called “variables.”

Once you have R and an R IDE installed, Power BI Desktop will detect them. You can see this in the Power BI Desktop Options.

Mark shows you step by step using some snazzy SnagIt imagery.

Comments closed

Speed Up Your Spark Queries

John Mount has some good advice for R users running Spark queries:

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.”

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

This is good advice for more than just dealing with R on Spark.

Comments closed