Press "Enter" to skip to content

Category: R

Fun With Random Walks

Emrah Mete simulates a random walk in R:

Let’s consider a game where a gambler is likely to win $1 with a probability of p and lose $1 with a probability of 1-p.

Now, let’s consider a game where a gambler is likely to win $1 and lose $1 with a probability of 1. The player starts the game with X dollars in hand. The player plays the game until the money in his hand reaches N (N> X) or he has no money left. What is the probability that the player will reach the target value? (We know that the player will not leave the game until he reaches the N value he wants to win.)

The problem of the story above is known in literature as Gambler’s Ruin or Random Walk. In this article, I will simulate this problem with R with different settings and examine how the game results change with different settings.

Click through for the script and analysis.  There’s a reason they call this game the gambler’s ruin.

Comments closed

DataExplorer

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%.  I will have to check this package out.

Comments closed

Using Power BI To Pass Parameterized Values To R

Stacia Varga shows how you can parameterize your R scripts within Power BI:

It’s not difficult, but the cool thing about Power BI is that I can use parameters to dynamically change the report visualization without opening up the script. To do this:

  • Open the Query Editor in Power BI

  • Click Manage Parameters, and then click New Parameter.

  • Set the parameter properties – Name, Type, and Current Value.The Name is how I will reference the parameter my R script, the Type is the data type, and Current Value is the initial value that I want to set (if any).

Click through for an example and more details.

Comments closed

Faceted ggplot2

I have another post in my ggplot2 series, this time covering facets:

Notice that we create a graph per continent by setting facets = ~continent.  The tilde there is important—it’s a one-sided formula.  You could also write c("continent") if that’s clearer to you.

I also set the number of columns, guaranteeing that we see no more than 3 columns of grids. I could alternatively set nrow, which would guarantee we see no more than a certain number of rows.

There are a couple other interesting features in facet_wrap. First, we can set scales = "free" if we want to draw each grid as if the others did not exist. By default, we use a scale of “fixed” to ensure that everything plots on the same scale. I prefer that for this exercise because it lets us more easily see those continental clusters.

Facets let you compare multiple graphs quickly.  They’re great for fast comparison, but as I show in the post, you can distort the way the data looks by lining it up horizontally or vertically.

Comments closed

Themes And Legends In ggplot2

I have another part of my ggplot2 series up, this time on themes and legends:

You are not limited to using defaults in your graphs.  Let’s go back to the minimal theme but change the fonts a bit.  I want to make the following changes:

  1. Use Gill Sans fonts instead of the default

  2. Increase the title font size a little bit

  3. Decrease the X axis font size a little bit

  4. Remove the Y axis; the subtitle makes it clear what the Y axis contains

By the time we’re through this, we have publication-quality visuals in a few dozen lines of code.  I also have provided a bonus rant on Windows and R and fonts because that’s a nasty experience.

Comments closed

Labels And Annotations In ggplot2

I have another post in my ggplot2 series:

Annotations are useful for marking out important comments in your visual.  For example, going back to our wealth and longevity chart, there was a group of Asian countries with extremely high GDP but relatively low average life expectancy.  I’d like to call out that section of the visual and will use an annotation to do so.  To do this, I use the annotate() function.  In this case, I’m going to create a text annotation as well as a rectangle annotation so you can see exactly the points I mean.

By this point, we’re getting closer and closer to high-quality graphics.

Comments closed

Library Paths In R

Stacia Varga troubleshoots an issue integrating Power BI with R:

As I was putting together an example of using an R script as a Power BI data source, I ran into some issues on my development machine that was frankly driving me crazy. When I tried to run the query in Power BI with my R script (that ran successfully in the IDE, by the way), I kept getting this message:

DataSource.Error: ADO.NET: R script error.
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
  namespace 'scales' 0.3.0 is being loaded, but >= 0.4.1 is required
Error: package or namespace load failed for 'rnoaa'
Execution halted

Stacia’s answer works as long as the .libPaths() results match expectations.  Another idea would be to set the R_LIBS_USER user-level environment variable to the desired starting directory and that should force the directory in the environment variable to be first when calling .libPaths().

Comments closed

Dealing With Dates In R

Mathew McLean shows how to convert strings to dates using a couple well-known packages and introduces flipTime:

The package flipTime provides utilities for working with time series and date-time data. The package can be installed from GitHub using

1
2
require(devtools)
install_github("Displayr/flipTime")

I will discuss only two functions from the package in this post, AsDate() and AsDateTime(). These are used for the conversion of date and date-time strings, respectively. These functions build on the convenience and speed of the lubridate function. Furthermore, the flipTime functions provide additional functionality (making them easier to use). The functions are smart about identifying the proper format to use. So the user doesn’t need to specify the format(s) as inputs. At the same time, both AsDate() and AsDateTime() are careful to not convert any strings to dates when they are not formatted as dates. Additionally, it will also warn the user when the dates are not in an unambiguous format.

Check it out.

Comments closed

ARIMA In R

Subhasree Chatterjee shows us how to use R to implement an ARIMA model:

Once the data is ready and satisfies all the assumptions of modeling, to determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

To examine which p and q values will be appropriate we need to run acf() and pacf() function.

pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.

ARIMA feels like it should be too simple to work, but it does.

Comments closed

ggplot2 Scales And Coordinates

I continue my series on ggplot2:

The other thing I want to cover today is coordinate systems.  The ggplot2 documentation shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is coord_flip, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.

There are a lot of pictures and more step-by-step work.  Most of these are still 3-4 lines of code, so again, pretty simple.

Comments closed