Press "Enter" to skip to content

Category: R

Dplyr Tutorial

Deepanshu Bhalla has a nice dplyr tutorial:

What is dplyr?
dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.
 
What’s special about dplyr?

The package “dplyr” comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it’s very easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is used to filter rows.

dplyr is a core package when it comes to data cleansing in R, and the more of it you can internalize, the faster you’ll move in that language.

Comments closed

Looping In R

Klodian Dhana explains how to build a for loop in R:

I used linear mixed effect model and therefore I loaded the lme4 library. The loop should work with other regression analysis (i.e. linear regression), if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.

As other loops, this call variables of interest one by one and for each of them extract and store the betas, standard error and p value. Remember, this code is specific for linear mixed effect models.

Read the whole thing.  It’s good to keep in mind, though, that set-based R operations tend to perform best, so save looping for cases in which you can’t build a set-based function.

Comments closed

Mapping Geospatial Data

The folks at Sharp Sight Labs have a great blog post on mapping geospatial data using R:

If you’ve learned the basics of data visualization in R (namely, ggplot2) and you’re interested in geospatial visualization, use this as a small, narrowly-defined exercize to practice some intermediate skills.

There are at least three things that you can learn and practice with this visualization:

  1. Learn about color: Part of what makes this visualization compelling are the colors. Notice that in the area surrounding the US, we’re not using pure black, but a dark grey. For the title, we’re not using white, but a medium grey. Also, notice that for the rivers, we’re not using “blue” but a very specific hexadecimal color. These are all deliberate choices. As an exercise, I highly recommend modifying the colors. Play around a bit and see how changing the colors changes the “feel” of the visualization.

  2. Learn to build visualizations in layers: I’ve emphasized this several times recently, but layering is an important principle of data visualization. Notice that we’re layering the river data over the USA country map. As an exercise, you could also layer in the state boundaries between the country map and the rivers. To do this, you can use map_data().

  3. Learn about ‘Spatial’ data: R has several classes for dealing with ‘geospatial’ data, such as ‘SpatialLines‘, ‘SpatialPoints‘, and others. Spatial data is a whole different animal, so you’ll have to learn its structure. This example will give you a little experience dealing with it.

I also like the iterative approach they discuss.  You’ll almost never get it right the first go-around, but one of the nice things about ggplot2 is that it’s designed to be iterative:  you layer your changes on, making it a bit easier to fiddle with them to get the visualization just right.

Comments closed

Articles On R And SQL Server

Tomas Kastrun links to a series of his published articles on working with SQL Server and R:

In past couple of months, I have prepared several articles on R and SQL Server that have been published on SQL Server Central.

The idea was, to have couple of articles covering the introduction to R, to basics on R Server, to some practical cases on R with SQL Server.

There’s a nice flow here, building up from the basics to practical marginal improvements.

Comments closed

Custom Power BI Shapes Using R

Koen Verbeeck uses R to create dynamically changing images in Power BI:

You can insert images into Power BI Desktop, but these are static images. If you want them to dynamically change, you need the Image Viewer custom visual. Unfortunately, it doesn’t support measures, only columns. Since we want dynamic changes, fixed column values are not going to work. Someone proposed a work around on the Power BI forums, but this only works if you have a fixed set of attributes you want to slice on (for example, 4 categories). I want a totally flexible solution (e.g. each month we have a couple of new weeks to slice on), so again, not possible.

The only solution I could think of that would still work: using R visuals.

Read on for the solution.

Comments closed

Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

RevoScaleR

Tomaz Kastrun explains how the RevoScaleR package is useful:

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

If you’re using SQL Server R Services, these rx functions will become very important to you.

Comments closed

Superheat

David Smith shows off a very cool heatmap package called superheat:

While the superheat pacakge uses the ggplot2 package internally, it doesn’t itself follow the grammar of graphics paradigm: the function is more like a traditional base R graphics function with a couple of dozen options, and it creates a plot directly rather than returning a ggplot2 object that can be further customized. But as long as the options cover your heatmap needs (and that’s likely), you should find it a useful tool next time you need to represent data on a grid.

The superheat package apparently works with any R version after 3.1 (and I can confirm it works on the most recent, R 3.3.2). This arXiv paper provides some details and several case studies, and you can find more examples here. Check out the vignette for detailed usage instructions, and download it from its GitHub repository linked below.

Click through for some great-looking examples.

Comments closed

Data Frame Serialization In R

David Smith shows a new contender for serializing data frames in R, fst:

And now there’s a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).

These early numbers look great, so this is a project worth keeping an eye on.

Comments closed

10,000 R Packages

David Smith notes that CRAN is now up to 10,000 packages:

Having so many packages available can be a double-edged sword though: it can take some searching to find the package you need. Luckily, there are some resources available to help you:

  • MRAN (the Microsoft R Application Network) provides a search tool for R packages on CRAN.

  • To find the most popular packages, Rdocumentation.org provides a leaderboard of packages by number of downloads. It also provides lists of newly-released and recently-updated packages.

R is a big language; having good heuristics for figuring out where to find appropriate packages is extremely important.

Comments closed