Category: R

Donating To The R Foundation

Published 2018-12-12 by Kevin Feasel

Mark Niemann-Ross explains how you can donate to the R Foundation:

I benefit from the work of the R Foundation. They oversee the language, but also encourage a healthy ecosystem. CRAN happens because of them. Updates to R happen because of them. useR! happens because of them. Every day, you and I are the recipients of some part of their time.
The least we can do is show them some appreciation. If you point your web browser at https://www.r-project.org/foundation/donations.html you’ll find a convenient (and surprisingly inexpensive) place to express your appreciation. As an individual, you can send these kind folks twenty-five euros to tell them you’re in favor of what they do.

But be sure to read the whole thing, especially if you are an American who wants the donation to be tax-deductible. I believe that earmarking in this case is adding special instructions on SIAA’s PayPal page.

Comments closed

Timing Means Of Groups With R

Published 2018-12-12 by Kevin Feasel

John Mount shares some performance measures pitting data.table against various dplyr methods for calculating grouped means:

In this reproduction attempt we see:
– The dplyr time being around 0.05 seconds. This is about 5 times slower than claimed.
– The dplyr sum()/n() time is about 0.2 seconds, about 5 times faster than claimed.
– The data.table time being around 0.004 seconds. This is about three times as fast as the dplyr claims, and over ten times as fast as the actual observed dplyr behavior.

Read the whole thing. If you want to replicate it yourself, check out the RMarkdown file.

Comments closed

ggmap Tutorial

Published 2018-12-12 by Kevin Feasel

Laura Ellis has an updated ggmap tutorial:

For those of you who have been following along with issue #51 in the ggmap repo, you’ll notice that there have been a number of changes in the Google Maps Static API service. Unfortunately these have caused some breakage in previous ggmap functionality.
If you used this package prior to July 2018, you may were likely able to do so without signing up for the Google Static Map API service yourself. As indicated on the the ggmap github repo – “Google has recently changed its API requirements, and ggmap users are now required to provide an API key and enable billing. The billing enablement especially is a bit of a downer, but you can use the free tier without incurring charges. Also, the service being exposed through an easy to use r package that extends ggplot2 is pretty great so I’ll allow it.

This recent API change hurts. But click through for the tutorial, which doesn’t hurt.

Comments closed

Reporting On Unit Tests In R With covrpage

Published 2018-12-11 by Kevin Feasel

Maelle Salmon recaps Locke Data’s involvement with the covrpage package:

To read more about getting started with covrpage in your own package in a few lines of code only, we recommend checking out the “get started” vignette. It explains more how to setup the Travis deploy, mentions which functions power the covrpage report, and gives more motivation for using covrpage.
And to learn how the information provided by covrpage should be read, read the “How to read the covrpage report” vignette.

Check it out.

Comments closed

The Intuition Behind Principal Component Analysis

Published 2018-12-07 by Kevin Feasel

Holger von Jouanne-Diedrich gives us an intuition behind how principal component analysis (PCA) works:

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.
PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.
Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Click through for the rest of the story.

Comments closed

Plotting Diagrams In R With nest() And map()

Published 2018-12-06 by Kevin Feasel

Sebastian Sauer shows how to display multiple ggplot2 diagrams together using facets as well as a combination of the nest() and map() functions:

One simple way is to plot several facets according to the grouping variable:
d %>% 
  ggplot() +
  aes(x = hp, y = mpg) +
  geom_point() +
  facet_wrap(~ cyl)

Faceting is great, but it’s good to know the other technique as well.

Comments closed

Working With Missing Values In R

Published 2018-12-05 by Kevin Feasel

Anisa Dhana has a few examples of ways we can work with data containing missing values in R:

Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE package.

In this short post, I will focus on management of the missing data using the tidyverse package. Specifically, I will show how to manage missings in the long data format (i.e., more than one observation for id).

Anisa shows a few different techniques, depending upon what you need to do with the data. I’d caution about using mean in the second example and instead typically prefer median, as replacing missing values with the median won’t alter the distribution in the way that it can with mean.

Comments closed

Gradient Boosting And XGBoost

Published 2018-11-30 by Kevin Feasel

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Comments closed

Visualizing Traditional Japanese Color Palettes

Published 2018-11-30 by Kevin Feasel

Chisato den Engelsen looks at 465 traditional color palettes used in Japan:

Since each of colours had name, I also was curious if there are some characters that are used more often than other. Colour name was written in two ways in this website. One in Kanji and other in Hiragana.

I love wordcloud2 to visualize the wordcloud, so I can see which characters appears more often the others.

It’s an interesting exercise and all of the R code is included. Be sure to check out the list of colors with a character representing “rat” or “mouse” in the name. H/T R-Bloggers

Comments closed

Azure SQL Database Supports R Integration

Published 2018-11-29 by Kevin Feasel

David Smith notes that Azure SQL Database now has (in preview) support for R:

Azure SQL Database, the database-as-a-service based on Microsoft SQL Server, now offers R integration. (The service is currently in preview; details on how to sign up for the preview are provided in that link.) While you’ve been able to run R in SQL Server in the cloud since the release of SQL Server 2016 by running a virtual machine, Azure SQL Database is a fully-managed instance that doesn’t require you to set up and maintain the underlying infrastructure. You just choose the size and scale of the database you want to manage, and then connect to it like any other SQL Server instance. (If you want to learn how to set up an Azure SQL database, this Microsoft Learn module is a good place to start.)

Python and Java are not yet supported, but I’d imagine that they’ll be on the way too.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31