Press "Enter" to skip to content

Category: R

Why Does Empirical Variance Use n-1 Instead Of n?

Sebastian Sauer gives us a simulation showing why we use n-1 instead of n as the denominator when calculating the variance of a sample:

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was n=10 in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

This is an R-heavy post and does a great job of showing that it’s necessary, and ends with  recommended reading if you want to understand the why.

Comments closed

ggplot2 Geoms And Aesthetics

Tyler Rinker digs into ggplot2’s geoms and aesthetics:

I thought it my be fun to use the geoms aesthetics to see if we could cluster aesthetically similar geoms closer together. The heatmap below uses cosine similarity and heirarchical clustering to reorder the matrix that will allow for like geoms to be found closer to one another (note that today I learned from “R for Data Science” about the seriation package [https://cran.r-project.org/web/packages/seriation/index.html] that may make this matrix reordering task much easier).

It’s an interesting analysis of what’s available within ggplot2 and a detailed look at how different geoms fit together with respect to aesthetic options.

Comments closed

Legible Function Chaining In R

John Mount shows a few techniques for legible function chaining with R:

The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

My preference is to use dplyr + magrittr because I really do like that pipe operator.  John’s point is well-taken, however:  you don’t need to use the tidyverse to write clean R code, and there can be value in using the base functionality.

Comments closed

vtreat

John Mount explains the vtreat package that he and Nina Zumel have put together:

When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

That’s immediately going onto my learn-more list.

Comments closed

R 3.4.4 Now Available

David Smith notes that R 3.4.4 is now generally available:

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

Read on to see the change list.

Comments closed

Levels And Unique In R

Eric Cai demonstrates the difference between levels() and unique() when dealing with factors in R:

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function.  Here is my interpretation of the intent behind the creators of base R: The possible values of a factor are fundamental attributes of that variable, which should not be altered because of changes in the data.

There’s some back-and-forth in the comments; my takeaway is that both are useful functions depending upon what, exactly, you want to learn.

Comments closed

For Loops And R

John Mount has a couple of tips around using for loops in R.  First up, pre-allocate lists to make certain types of iterative processing faster:

Another R tip. Use vector(mode = "list") to pre-allocate lists.

result <- vector(mode = "list", 3)
print(result)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL

The above used to be critical for writing performant R code (R seems to have greatly improved incremental list growth over the years). It remains a convenient thing to know.

Also, use loop indices when iterating through for loops:

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()-loop.

d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
  print(di)
}
#> [1] 1518977777
#> [1] 1518977777

Notice we printed numbers, not dates/times.

Very useful information.

Comments closed

Rolling Calculations In R

Steph Locke explains some tricky behavior with window functions in R:

So looking at the code I wrote, you may have expectedc2 to hold NA, 3, 5, ... where it’s taking the current value and the prior value to make a window of width 2. Another reasonable alternative is that you may have expected c2 to hold NA, NA, 3, ... where it’s summing up the prior two values. But hey, it’s kinda working like cumsum() right so that’s ok! But wait, check out c3. I gave c3 a window of width 3 and it gave me NA, 6, 9, ... which looks like it’s summing the prior value, the current value, and the next value. …. That’s weird right?

It turns out the default behaviour for these rolling calculations is to center align the window, which means the window sits over the current value and tries it’s best to fit over the prior and next values equally. In the case of us giving it an even number it decided to put the window over the next values more than the prior values.

Knowing your window is critical when using a window function, and knowing that some functions have different default windows than others helps you be prepared.

Comments closed

Disagreement On Outliers

Antony Unwin reviews how various packages track outliers using the Overview of Outliers plot in R:

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Interesting reading.

Comments closed