Press "Enter" to skip to content

Category: R

Exploratory Data Analysis In R

Laura Ellis walks us through some easy techniques for learning about our data using R:

DIM AND GLIMPSE

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

Spending some quality time doing EDA can save you in the long run, as it can help you get a feel for things like data quality, the distributions of variables, and completeness of data.

Comments closed

Azure ML Studio Supports R 3.4

David Smith notes that Azure ML Studio now supports R version 3.4:

With the Execute R Script module you can immediately use more than 650 R packages which come preinstalled in the Azure ML Studio environment. You can also use other R packages (including packages not on CRAN) and source in R scripts you develop elsewhere (as shown above), although this does require the time to install them in the Studio environment. You can even create custom ML Studio models encapsulating R code for others to use in the drag-and-drop environment.

If you’re new to Azure ML Studio, check out the Quickstart Tutorial for R to learn how use the Execute R Script module, and to check out what’s new in the latest update follow the link below.

Click through for more details.

Comments closed

Investigating UK Traffic With Principal Component Analysis

Michael Grogan shows us how to use Principal Component Analysis (PCA) to classify route segments in UK transportation data:

Specifically, let us assume that we wish to analyze traffic density for buses and coaches. The main thing we are interested in is the frequency of traffic across a particular route.

Let’s take an example. If buses cover 100 miles on a route that is 5 miles long within a certain timeframe, then the frequency will be greater than 100 miles covered on a route that is 10 miles long over the same time period.

Read on for an interesting example.

Comments closed

Checking Functional Dependencies In R Data Frames

John Mount shows us how to use the psagg function in wrapr to ensure that functional dependencies are valid:

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column zis not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a “pseudo aggregation” and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name y (instead of picking a new name such as max_y), we expect the y values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection y[[1]]) was also used in the solutions of the column indexing problem.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

In this post, John calls the act of grouping functional dependencies (where we can determine the value of y based on the value of x, for any number of columns in y or x) pseudo-aggregation.

Comments closed

Using datapasta To Paste Spreadsheet Data In R

Mara Averick shows us how we can use datapasta with RStudio to create good representative examples when asking questions:

So, you’ve been asked to make a reprex and you want to include a bit of data that you have in a spreadsheet. Meet {datapasta}, a package by Miles McBain that can make your life a whole lot easier. Once you’ve installed datapasta, you simply copy a selected number of rows from your spreadsheet (remember, this is a minimal reproducible example), and click the Paste as tribble option from the DATAPASTA section of the Addins dropdown

Click through for a demo.

Comments closed

Building Custom R Visuals In Power BI

Brad Lewellyn shows us how to create custom R visuals within Power BI:

Over the last few posts, we’ve shown how to use custom R visuals built by others.  Today, we’re going to build our own using the Custom R Visual available in Power BI Desktop.  If you haven’t read the second post in this series, Getting Started with R Scripts, it is highly recommended you do so now, as it provides necessary context for how to link Power BI to your local R ISE.

In the previous post, we created a bunch of log-transformed measures to find good predictors for Revenue.  We’re going to use these same measures today to create a basic linear regression model to predict Revenue.  If you want to follow along, the dataset can be found here.  Here’s the custom DAX we used to create the necessary measures.

Click through for the example.

Comments closed

Taking Advantage Of Vectorization In R

John Mount explains, using Conway’s Game of Life, the importance of using vectors in R over scalars:

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Conway’s Game of Life is one of the most interesting examples of cellular automata. It is traditionally simulated on a rectangular grid (like a chessboard) and each cell is considered either live or dead. The rules of evolution are simple: the next life grid is computed as follows:

  • To compute the state of a cell on the next grid sum the number of live cells in the eight neighboring cells on the current grid.

  • If this sum is 3 or if the current cell is live and the sum is 2 or 3, then the cell in the next grid will be live.

Not only is the R code faster, but it’s also terser.

Comments closed

Visualizing A Correlation Matrix With corrplot

Kristian Larsen demonstrates the corrplot package in R:

First we need to read the packages into the R library. For descriptive statistics of the dataset we use the skimr package and for visualization of correlation matrix we use the corrplot package. We will work with windspeed dataset from the bReeze package:

# Read packages into R library
library(bReeze)
library(corrplot)
library(skimr)

Click through for the demo.

Comments closed

Getting The Right R Version For Packages

Colin Gillespie shows a couple methods for figuring out the minimum version of R needed for a set of packages:

In R, there is a handy function called available.packages() that returns a matrix of details corresponding to packages currently available at one or more repositories. Unfortunately, the format isn’t initially amenable to manipulation. For example, consider the readr package

readr_desc = available.packages() %>%
  as_tibble() %>%
  filter(Package == "readr")

I immediately converted the data to a tibble, as that

  • changed the rownames to a proper column

  • changed the matrix to a data frame/tibble, which made selecting easier

There’s a good use of R functionality to delve into package requirements, as well as a script to try it out yourself.

Comments closed

Packages For Testing R Packages

Maelle Salmon shows us how to test our R packages within R:

If you’re brand-new to unit testing your R package, I’d recommend reading this chapter from Hadley Wickham’s book about R packages.

There’s an R package called RUnit for unit testing, but in the whole post we’ll mention resources around the testthat package since it’s the one we use in our packages, and arguably the most popular one. testthat is great! Don’t hesitate to reads its docs again if you started using it a while ago, since the latest major release added the setup() and teardown() functions to run code before and after all tests, very handy.

To setup testing in an existing package i.e. creating the test folder and adding testthat as a dependency, run usethis::use_testthat(). In our WIP pRojects package, we set up the tests directory for you so you don’t forget. Then, in any case, add new tests for a function using usethis::use_test().

The testthis package might help make your testing workflow even smoother. In particular, test_this() “reloads the package and runs tests associated with the currently open R script file.”, and there’s also a function for opening the test file associated with the current R script.

This is an area where I know I need to get better, and Maelle gives us a plethora of tooling for tests.

Comments closed