Press "Enter" to skip to content

Category: R

Microsoft + R

David Smith points out a bunch of the ways that Microsoft integrates R into products:

You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you’ve written) to the data they contain.

  • SQL Server (the database) allows you to call R from SQL, or publish R functions to a SQL Server for database adminstrators to use from SQL.

  • Power BI (the reporting and visualization tool) allows you to call R functions to process data, create graphics, or apply statistical models to data.

  • Visual Studio (the integrated development environment) includes R as a fully-supported language with syntax highlighting, debugging, etc.

  • R is supported in various cloud-based services in Azure, including the Data Science Virtual Machine and Azure Machine Learning Studio. You can also publish R functions to Azure with the AzureML package, and then call those R functions from applications like Excel or apps you write yourself.

They’re pretty well invested in both R and Python, which is a good thing.

Comments closed

Using cowplot With ggplot2

I have a post on extending ggplot2’s functionality with cowplot:

Notice that I used geom_path().  This is a geom I did not cover earlier in the series.  It’s not a common geom, though it does show up in charts like this where we want to display data for three variables.  The geom_line() geom follows the basic rules for a line:  that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in.  The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable.  The trick is, though, that we don’t define this third variable—it’s implicit in the data set order.  In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp).  Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set.  So let’s look at a place where that’s not true.

The cowplot library gives you an easier way of linking together different plots of different sizes in a couple lines of code, which is much easier than using ggplot2 by itself.

Comments closed

Simple Data Transformation Tricks In R

Abdul Majed Raja has a few tidyverse-friendly data transformation tips:

Splitting a column to many columns is a cliched Data Transformation case that’s hardly unseen while performing Data Transformation. While it’s straightforward to do this in Microsoft Excel, it’s slightly tricky using Data analytics languages. That is true until this function separate() from tidyr came.

These are small but helpful tips.

Comments closed

Fun With Random Walks

Emrah Mete simulates a random walk in R:

Let’s consider a game where a gambler is likely to win $1 with a probability of p and lose $1 with a probability of 1-p.

Now, let’s consider a game where a gambler is likely to win $1 and lose $1 with a probability of 1. The player starts the game with X dollars in hand. The player plays the game until the money in his hand reaches N (N> X) or he has no money left. What is the probability that the player will reach the target value? (We know that the player will not leave the game until he reaches the N value he wants to win.)

The problem of the story above is known in literature as Gambler’s Ruin or Random Walk. In this article, I will simulate this problem with R with different settings and examine how the game results change with different settings.

Click through for the script and analysis.  There’s a reason they call this game the gambler’s ruin.

Comments closed

DataExplorer

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%.  I will have to check this package out.

Comments closed

Using Power BI To Pass Parameterized Values To R

Stacia Varga shows how you can parameterize your R scripts within Power BI:

It’s not difficult, but the cool thing about Power BI is that I can use parameters to dynamically change the report visualization without opening up the script. To do this:

  • Open the Query Editor in Power BI

  • Click Manage Parameters, and then click New Parameter.

  • Set the parameter properties – Name, Type, and Current Value.The Name is how I will reference the parameter my R script, the Type is the data type, and Current Value is the initial value that I want to set (if any).

Click through for an example and more details.

Comments closed

Faceted ggplot2

I have another post in my ggplot2 series, this time covering facets:

Notice that we create a graph per continent by setting facets = ~continent.  The tilde there is important—it’s a one-sided formula.  You could also write c("continent") if that’s clearer to you.

I also set the number of columns, guaranteeing that we see no more than 3 columns of grids. I could alternatively set nrow, which would guarantee we see no more than a certain number of rows.

There are a couple other interesting features in facet_wrap. First, we can set scales = "free" if we want to draw each grid as if the others did not exist. By default, we use a scale of “fixed” to ensure that everything plots on the same scale. I prefer that for this exercise because it lets us more easily see those continental clusters.

Facets let you compare multiple graphs quickly.  They’re great for fast comparison, but as I show in the post, you can distort the way the data looks by lining it up horizontally or vertically.

Comments closed

Themes And Legends In ggplot2

I have another part of my ggplot2 series up, this time on themes and legends:

You are not limited to using defaults in your graphs.  Let’s go back to the minimal theme but change the fonts a bit.  I want to make the following changes:

  1. Use Gill Sans fonts instead of the default

  2. Increase the title font size a little bit

  3. Decrease the X axis font size a little bit

  4. Remove the Y axis; the subtitle makes it clear what the Y axis contains

By the time we’re through this, we have publication-quality visuals in a few dozen lines of code.  I also have provided a bonus rant on Windows and R and fonts because that’s a nasty experience.

Comments closed

Labels And Annotations In ggplot2

I have another post in my ggplot2 series:

Annotations are useful for marking out important comments in your visual.  For example, going back to our wealth and longevity chart, there was a group of Asian countries with extremely high GDP but relatively low average life expectancy.  I’d like to call out that section of the visual and will use an annotation to do so.  To do this, I use the annotate() function.  In this case, I’m going to create a text annotation as well as a rectangle annotation so you can see exactly the points I mean.

By this point, we’re getting closer and closer to high-quality graphics.

Comments closed

Library Paths In R

Stacia Varga troubleshoots an issue integrating Power BI with R:

As I was putting together an example of using an R script as a Power BI data source, I ran into some issues on my development machine that was frankly driving me crazy. When I tried to run the query in Power BI with my R script (that ran successfully in the IDE, by the way), I kept getting this message:

DataSource.Error: ADO.NET: R script error.
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :

  namespace 'scales' 0.3.0 is being loaded, but >= 0.4.1 is required

Error: package or namespace load failed for 'rnoaa'

Execution halted

Stacia’s answer works as long as the .libPaths() results match expectations.  Another idea would be to set the R_LIBS_USER user-level environment variable to the desired starting directory and that should force the directory in the environment variable to be first when calling .libPaths().

Comments closed