Category: R

Nested Resampling In R

Published 2017-09-07 by Kevin Feasel

Max Kuhn describes how nested resampling works:

A common method for tuning models is grid search where a candidate set of tuning parameters is created. The full set of models for every combination of the tuning parameter grid and the resamples is created. Each time, the assessment data are used to measure performance and the average value is determined for each tuning parameter.

The potential problem is, once we pick the tuning parameter associated with the best performance, this value is usually quoted as the performance of the model. There is serious potential for optimization bias since we uses the same data to tune the model and quote performance. This can result in an optimistic estimate of performance.

Nested resampling does an additional layer of resampling that separates the tuning activities from the process used to estimate the efficacy of the model. An outer resampling scheme is used and, for every split in the outer resample, another full set of resampling splits are created on the original analysis set. For example, if 10-fold cross-validation is used on the outside and 5-fold cross-validation on the inside, a total of 500 models will be fit. The parameter tuning will be conducted 10 times and the best parameters are determined from the average of the 5 assessment sets.

Definitely worth the read. H/T R-Bloggers

Comments closed

Parallelism In R

Published 2017-09-06 by Kevin Feasel

Florian Prive shows off a few methods for parallelizing code in R:

Parallelize with foreach

You need to do at least two things:

replace %do% by %dopar%. Basically, always use %dopar% because you can use registerDoSEQ() is you really want to run the foreach sequentially.
register a parallel backend using one of the packages that begin with do (such as doParallel, doMC, doMPI and more). I will list only the two main parallel backends because there are too many of them.

Check it out. Florian spends a lot of time with foreach and doParallel, a little bit of time with flock, and mentions Microsoft R Open. H/T R-Bloggers

Comments closed

Counting Rows In Spark With Dplyr

Published 2017-09-04 by Kevin Feasel

John Mount discusses the difficulty of using dplyr to count rows in Spark:

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

Checking if a table is empty.
Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
Confirming data size is the same as reported in other sources (Spark, database, and so on).
Reporting amount of work performed or rows-per-second processed.

Read the whole thing; this seems unnecessarily complicated.

Comments closed

Text Featurizing With Microsoft R Server

Published 2017-09-01 by Kevin Feasel

David Smith has a post summarizing sentiment analysis with Microsoft R Server:

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here’s one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It’s great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

If you’re thinking about sentiment analysis, read the whole thing.

Comments closed

Getting Distinct Rows In R

Published 2017-09-01 by Kevin Feasel

Rob J. Hyndman shows four different techniques (one “classic” and three tidyverse) for getting a distinct subset of a data set in R:

So that looks much better — clean, short, and easy to understand. But is it fast? Rather than grabbing the first lines of each group, it has to go searching for duplicates. But avoiding grouping and ungrouping must save some time.

So I ran some microbenchmark timings:

Click through for techniques and timings. I’m not surprised that the “classic” method won out in terms of time, but for explanatory value, I’d definitely prefer trying to explain the tidyverse distinct version. H/T R-Bloggers

Comments closed

R Versus Python

Published 2017-08-31 by Kevin Feasel

Vincent Granville believes that Python is overtaking R in the realm of data science:

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart. It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

Comments closed

Tokenizing Text With R

Published 2017-08-30 by Kevin Feasel

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

Read text into R

Select only certain lines

Tokenize text using the tidytext package

Calculate token frequency (how often each token shows up in the dataset)

Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English. You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

Comments closed

One-Way ANOVA Testing With R

Published 2017-08-30 by Kevin Feasel

Bidyut Ghosh shows how to perform a one-way ANOVA test in R:

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test.

ANOVA is a fairly simple test, but it can be quite useful to know.

Comments closed

The Value Of Tidyeval

Published 2017-08-30 by Kevin Feasel

Bruno Rodrigues explains why he likes tidyeval:

Last year, this column, let’s call it spam, had values 1 for good and 0 for bad. This year the column is called Spam and the values are 1 and 2. When I found out that this was the source of the problem, I just had to change the arguments of my functions from
generate_spam_plot(dataset = data2016, column = spam, value = 1)
generate_spam_plot(dataset = data2016, column = spam, value = 0)
to
generate_spam_plot(dataset = data2017, column = Spam, value = 1)
generate_spam_plot(dataset = data2017, column = Spam, value = 2)
without needing to change anything else. This is why I use tidyeval; without it, writing a function such as genereta_spam_plot would not be easy. It would be possible, but not easy.

Read the whole thing.

Comments closed

R Services Packet Captures

Published 2017-08-30 by Kevin Feasel

Niels Berglund continues his R Services internals series:

In Figure 15, I set the filter to be tcp.srcport==50755, and then I applied the filter by clicking the arrow. To start using this:

Clear the Process Monitor display, and make sure you are capturing events.

Start WireShark capturing (Ctrl+E). If you get a question whether you want to save the captured packets, just click “Continue without Saving”.

Execute the code in Code Snippet 3.

The Process Monitor output looks almost the same as in Figure 9, whereas the WireShark output looks like so:

Niels also includes a recap to help people who haven’t been following along get up to speed.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28