Press "Enter" to skip to content

Category: R

Outlier Detection With dplyr And ruler

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

  • Most basic outlier detection techniques.

  • A way to implement them using dplyr and ruler.

  • A way to combine their results in order to obtain a new outlier detection method.

  • A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing.  H/T R-Bloggers

Comments closed

rquery: Relational Algebra In R

John Mount announces rquery:

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with RSQL, and dplyr at big data scale in production).

As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.

dQ <- d %.>%
  extend_se(.,
            if_else_block(
              testexpr =
                "rand()>=0.5",
              thenexprs = qae(
                a_1 := 'treatment',
                a_2 := 'control'),
              elseexprs = qae(
                a_1 := 'control',
                a_2 := 'treatment'))) %.>%
  select_columns(., c("rowNum", "a_1", "a_2"))

It’s an interesting idea.

Comments closed

Getting Started With dplyr

Abdul Majed Raja has a dplyr tutorial:

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

This covers the main pieces  of dplyr, including its pipeline.  dplyr is a key part of the tidyverse, and knowing it well makes R so much easier.  H/T R-Bloggers

Comments closed

The ggplot2 Books

Hadley Wickham has a couple of books which teach a lot about ggplot2.  The first book I’d recommend is his and Garrett Grolemund’s R For Data Science book, which is available for free online:

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

Wickham also has the source to build his ggplot2 book online.  If you don’t want to build the source, you also have the option of buying the book.

Comments closed

A Layered Grammar Of Graphics

Hadley Wickham describes some of the decisions he made when putting together ggplot2:

In the examples above, we have seen some of the components that make up a plot:
• data and aesthetic mappings,
• geometric objects,
• scales, and
• facet specification.
We have also touched on two other components:
• statistical transformations, and
• the coordinate system.
Together, the data, mappings, statistical transformation, and geometric object form a layer. A plot may have multiple layers, for example, when we overlay a scatterplot with a smoothed line.

This isn’t an article about how to use ggplot2; rather, it’s an article about implementation decisions.  To that end, I think it’s useful to see some of the logic behind ggplot2’s decisions.

Comments closed

Data Visualization For Social Science

I’ve started reading Kieran Healy’s book, Data Visualization For Social Science.  He has a free draft available online, and it automatically builds nightly so you’re seeing the latest version.  From the preface:

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.

After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham 2016a). The ggplot2 library implements a “grammar of graphics” (Wilkinson 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.

Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.

I’m less than halfway through the book so far, but it is quite an approachable look at the ggplot2 library with a bit of discussion on what makes for quality graphics.

Comments closed

Fill And Highlight Countries On A Map

The folks at Sharp Sight Labs show how to create a professional-looking filled map in R, including highlighting specific countries:

Let’s point out a few things.

First, the fill color scale has been carefully crafted to optimally show differences between countries.

Second, we are simultaneously using the highlighting technique to highlight the OPEC countries.

Finally, notice that we’re using the title to “tell a story” about the highlighted data.

All told, there is a lot going on in this example.

It’s a very interesting example of building higher-quality visuals in R.  I also give them kudos for picking five colors which work for people with every kind of Color Vision Deficiency.  H/T R-Bloggers

Comments closed

Running R Scripts In Power BI

Mark Vaillancourt shows how to run an R script inside Power BI Desktop:

All of the options I will show require you to have R installed on your machine. I am using R version 3.4.3 I got here as well as R Studio (an IDE: Integrated Scripting Environment) version 1.1.383 I obtained here. You can also use Microsoft R Open, which you can get here. All are free. I am choosing base R and R Studio because I want to play with/show the use of non-Microsoft tools in conjunction with Microsoft tools. I am using 2.53.4954.481 64-bit (December 2017) of Power BI Desktop. Note that things could look/behave differently in other version of Power BI Desktop.

For this post, I am using a well-known dataset known as the Iris dataset, which you can read about here. I downloaded the zip file from here to obtain a csv file of the data set for one of my examples. The Iris dataset is also included in the “datasets” package in R Studio, which I will use as well.

Note: A key R concept to understand is that of a data frame, which is essentially just data in a tabular format. In a data frame, the “columns” are actually called “variables.”

Once you have R and an R IDE installed, Power BI Desktop will detect them. You can see this in the Power BI Desktop Options.

Mark shows you step by step using some snazzy SnagIt imagery.

Comments closed

Speed Up Your Spark Queries

John Mount has some good advice for R users running Spark queries:

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.”

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

This is good advice for more than just dealing with R on Spark.

Comments closed

Animated Dot Plots In R

John MacKintosh shows how to create dot plots in ggplot2, and then he uses gganimate to turn it into an animated plot:

If you want to follow along (go on) then you should head over to Neil’s site, download the excel file and take a look at the “how to” guide on the same page. Existing R users are already likely to be shuddering at all the manual manipulation required.

For the first attempt, I followed Neil’s approach pretty closely, resulting in a lot of code to sort and group, although ggplot2 made the actual plotting much simpler. I shared my very first attempt ( produce with barely any ggplot2 code) which was quite good, but there were a few issues – the ins/ outs being coloured blue instead of grey, and overplotting of several points.

Click through for code and explanation.  H/T R-Bloggers

Comments closed