Category: R

I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.

Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).

This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.

I admit that I’m just now getting into using projects regularly for my one-off stuff. This is very good advice. H/T David Smith

Comments closed

Scraping The PASS Budget

Published 2018-01-02 by Kevin Feasel

Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:

With tabulizer, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!

We need to fall back to tabulizer::extract_text() and do a lot of wrangling to reconstruct the tables.

Steph shows her work, so click through to see the scripts.

Comments closed

Outlier Detection With dplyr And ruler

Published 2017-12-29 by Kevin Feasel

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

Most basic outlier detection techniques.
A way to implement them using dplyr and ruler.
A way to combine their results in order to obtain a new outlier detection method.
A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing. H/T R-Bloggers

Comments closed

rquery: Relational Algebra In R

Published 2017-12-29 by Kevin Feasel

John Mount announces rquery:

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production).

As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.
dQ <- d %.>%
  extend_se(.,
            if_else_block(
              testexpr =
                "rand()>=0.5",
              thenexprs = qae(
                a_1 := 'treatment',
                a_2 := 'control'),
              elseexprs = qae(
                a_1 := 'control',
                a_2 := 'treatment'))) %.>%
  select_columns(., c("rowNum", "a_1", "a_2"))

It’s an interesting idea.

Comments closed

Getting Started With dplyr

Published 2017-12-29 by Kevin Feasel

Abdul Majed Raja has a dplyr tutorial:

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

This covers the main pieces of dplyr, including its pipeline. dplyr is a key part of the tidyverse, and knowing it well makes R so much easier. H/T R-Bloggers

Comments closed

The ggplot2 Books

Published 2017-12-26 by Kevin Feasel

Hadley Wickham has a couple of books which teach a lot about ggplot2. The first book I’d recommend is his and Garrett Grolemund’s R For Data Science book, which is available for free online:

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

Wickham also has the source to build his ggplot2 book online. If you don’t want to build the source, you also have the option of buying the book.

Comments closed

A Layered Grammar Of Graphics

Published 2017-12-26 by Kevin Feasel

Hadley Wickham describes some of the decisions he made when putting together ggplot2:

In the examples above, we have seen some of the components that make up a plot:
• data and aesthetic mappings,
• geometric objects,
• scales, and
• facet specification.
We have also touched on two other components:
• statistical transformations, and
• the coordinate system.
Together, the data, mappings, statistical transformation, and geometric object form a layer. A plot may have multiple layers, for example, when we overlay a scatterplot with a smoothed line.

This isn’t an article about how to use ggplot2; rather, it’s an article about implementation decisions. To that end, I think it’s useful to see some of the logic behind ggplot2’s decisions.

Comments closed

Data Visualization For Social Science

Published 2017-12-26 by Kevin Feasel

I’ve started reading Kieran Healy’s book, Data Visualization For Social Science. He has a free draft available online, and it automatically builds nightly so you’re seeing the latest version. From the preface:

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.

After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham 2016a). The ggplot2 library implements a “grammar of graphics” (Wilkinson 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.

Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.

I’m less than halfway through the book so far, but it is quite an approachable look at the ggplot2 library with a bit of discussion on what makes for quality graphics.

Comments closed

Fill And Highlight Countries On A Map

Published 2017-12-22 by Kevin Feasel

The folks at Sharp Sight Labs show how to create a professional-looking filled map in R, including highlighting specific countries:

Let’s point out a few things.

First, the fill color scale has been carefully crafted to optimally show differences between countries.

Second, we are simultaneously using the highlighting technique to highlight the OPEC countries.

Finally, notice that we’re using the title to “tell a story” about the highlighted data.

All told, there is a lot going on in this example.

It’s a very interesting example of building higher-quality visuals in R. I also give them kudos for picking five colors which work for people with every kind of Color Vision Deficiency. H/T R-Bloggers

Comments closed

Running R Scripts In Power BI

Published 2017-12-22 by Kevin Feasel

Mark Vaillancourt shows how to run an R script inside Power BI Desktop:

All of the options I will show require you to have R installed on your machine. I am using R version 3.4.3 I got here as well as R Studio (an IDE: Integrated Scripting Environment) version 1.1.383 I obtained here. You can also use Microsoft R Open, which you can get here. All are free. I am choosing base R and R Studio because I want to play with/show the use of non-Microsoft tools in conjunction with Microsoft tools. I am using 2.53.4954.481 64-bit (December 2017) of Power BI Desktop. Note that things could look/behave differently in other version of Power BI Desktop.

For this post, I am using a well-known dataset known as the Iris dataset, which you can read about here. I downloaded the zip file from here to obtain a csv file of the data set for one of my examples. The Iris dataset is also included in the “datasets” package in R Studio, which I will use as well.

Note: A key R concept to understand is that of a data frame, which is essentially just data in a tabular format. In a data frame, the “columns” are actually called “variables.”

Once you have R and an R IDE installed, Power BI Desktop will detect them. You can see this in the Power BI Desktop Options.

Mark shows you step by step using some snazzy SnagIt imagery.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31