Press "Enter" to skip to content

Category: R

Speed Up Your Spark Queries

John Mount has some good advice for R users running Spark queries:

For some time we have been teaching R users “when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis.”

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are “denormalized marts” that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Some wonder is this really an issue or is it something one can ignore in the hope the downstream query optimizer fixes the problem. In this note we will show the effect is real.

This is good advice for more than just dealing with R on Spark.

Comments closed

Animated Dot Plots In R

John MacKintosh shows how to create dot plots in ggplot2, and then he uses gganimate to turn it into an animated plot:

If you want to follow along (go on) then you should head over to Neil’s site, download the excel file and take a look at the “how to” guide on the same page. Existing R users are already likely to be shuddering at all the manual manipulation required.

For the first attempt, I followed Neil’s approach pretty closely, resulting in a lot of code to sort and group, although ggplot2 made the actual plotting much simpler. I shared my very first attempt ( produce with barely any ggplot2 code) which was quite good, but there were a few issues – the ins/ outs being coloured blue instead of grey, and overplotting of several points.

Click through for code and explanation.  H/T R-Bloggers

Comments closed

Hierarchical Clustering

Chaitanya Sagar explains hierarchical clustering with examples in R:

Hope now you have a better understanding of clustering algorithms than what you started with. We discussed about Divisive and Agglomerative clustering techniques and four linkage methods namely, Single, Complete, Average and Ward’s method. Next, we implemented the discussed techniques in R using a numeric dataset. Note that we didn’t have any categorical variable in the dataset we used. You need to treat the categorical variables in order to incorporate them into a clustering algorithm. Lastly, we discussed a couple of plots to visualise the clusters/groups formed. Note here that we have assumed value of ‘k’ (number of clusters) is known. However, this is not always the case. There are a number of heuristics and rules-of-thumb for picking number of clusters. A given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

There’s a lot to pick out of this post, but you’re able to walk through it step by step.  H/T R-Bloggers

Comments closed

Pipes And More Pipes In R

Gabriel (de Selding?) has a tutorial on how to use the various pipes in R:

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the +in a ggplot2 statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Auto-recommended for the F# love, and a good tutorial to boot.

John Mount has a few interesting notes on the topic:

Read on for the rest of his notes, too.

Comments closed

Moving From reshape2 To tidyr

Martin Johnsson talks about a couple tricky bits when moving from reshape2 to tidyr:

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like what you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

I’d not used reshape2 before, having started with tidyr, so it was interesting to see the contrast.

Comments closed

Data Manipulation In R

Steph Locke has a new book out:

Data Manipulation in R is the second book in my R Fundamentals series that takes folks from no programming knowledge through to an experienced R user. Working with Rfocussed on the very basics and this second book covers data wrangling.

Introducing the core skill of working with tabular data, this book goes from importing data, to analysing it, and then getting it back out for consumption elsewhere. Leaning heavily on the tidyverse, I think it’s an accessible introduction for those new to analysing data.

The book is free if you have Kindle Unlimited; otherwise, it’s not expensive at all.

Comments closed

An Introduction To seplyr

John Mount guest blogs on the Revolutions blog about seplyr:

seplyr is an R package that supplies improved standard evaluation interfaces for many common data wrangling tasks.

The core of seplyr is a re-skinning of dplyr‘s functionality to seplyr conventions (similar to how stringr re-skins the implementing package stringi).

Read on for a couple of examples of where seplyr can make it easier for you to program with than dplyr.

Comments closed

R In Linux For Windows

David Smith shows how to install and use R in the Windows Subsystem for Linux:

R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that’s easy to do with the new Fall Creator’s Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it’s easy to get things up and running.

Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that’s because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.

Click through for a quick tutorial.

Comments closed

A Hack For Dynamic ML Services Result Sets

Dave Mason has put together a solution to his dynamic data frame naming problem:

We can take those names and R types, string them together, and “convert” them to SQL data types. (Mapping data types from one language to another is waaaay outside the scope of this post. Lines 11-13 are quick and dirty, just for demonstration purposes. Okie dokie?)

It’s certainly a less than ideal solution, but it does the job.  And I hope as well that someday this functionality ends up being something built into the product.

Comments closed

Taking A Random Walk

Dan Goldstein describes the basics of Brownian motion:

I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

1 up (North)
2 diagonal to the upper right (Northeast)
3 diagonal to the lower right (Southeast)
4 down (South)
5 diagonal to the lower left (Southwest)
6 diagonal to the upper left (Northwest)

Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

We did this about 50 times. The results are pictured above, along with kid hands for scale.

Javi Fernandez-Lopez then shows how to generate an animated GIF displaying Brownian motion:

Last Monday we celebrated a “Scientific Marathon” at Royal Botanic Garden in Madrid, a kind of mini-conference to talk about our research. I was talking about the relation between fungal spore size and environmental variables such as temperature and precipitation. To make my presentation more friendly, I created a GIF to explain the Brownian Motion model. In evolutionary biology, we can use this model to simulate the random variation of a continuous trait through time. Under this model, we can notice how closer species tend to maintain closer trait values due to shared evolutionary history. You have a lot of information about Brownian Motion models in evolutionary biology everywhere!

Another place that this is useful is in describing stock market movements in the short run.

Comments closed