Press "Enter" to skip to content

Category: R

R Versus Python

Vincent Granville believes that Python is overtaking R in the realm of data science:

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart.  It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

Comments closed

Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

Comments closed

One-Way ANOVA Testing With R

Bidyut Ghosh shows how to perform a one-way ANOVA test in R:

From the above results, it is observed that the F-statistic value is 17.94 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis of equal mean value of mileage run across all the tyre brands. In other words, the average mileage of the four tyre brands are not equal.
Now you have to find out the pair of brands which differ. For this you may use the Tukey’s HSD test.

ANOVA is a fairly simple test, but it can be quite useful to know.

Comments closed

The Value Of Tidyeval

Bruno Rodrigues explains why he likes tidyeval:

Last year, this column, let’s call it spam, had values 1 for good and 0 for bad. This year the column is called Spam and the values are 1 and 2. When I found out that this was the source of the problem, I just had to change the arguments of my functions from

generate_spam_plot(dataset = data2016, column = spam, value = 1)
generate_spam_plot(dataset = data2016, column = spam, value = 0)

to

generate_spam_plot(dataset = data2017, column = Spam, value = 1)
generate_spam_plot(dataset = data2017, column = Spam, value = 2)

without needing to change anything else. This is why I use tidyeval; without it, writing a function such as genereta_spam_plot would not be easy. It would be possible, but not easy.

Read the whole thing.

Comments closed

R Services Packet Captures

Niels Berglund continues his R Services internals series:

In Figure 15, I set the filter to be tcp.srcport==50755, and then I applied the filter by clicking the arrow. To start using this:

  • Clear the Process Monitor display, and make sure you are capturing events.
  • Start WireShark capturing (Ctrl+E). If you get a question whether you want to save the captured packets, just click “Continue without Saving”.
  • Execute the code in Code Snippet 3.

The Process Monitor output looks almost the same as in Figure 9, whereas the WireShark output looks like so:

Niels also includes a recap to help people who haven’t been following along get up to speed.

Comments closed

Gradient Boosting In R

Anish Sing Walia walks us through a gradient boosting exercise using R:

An important thing to remember in boosting is that the base learner which is being boosted should not be a complex and complicated learner which has high variance for e.g a neural network with lots of nodes and high weight values.For such learners boosting will have inverse effects.

So I will explain Boosting with respect to decision trees in this tutorial because they can be regarded as weak learners most of the times.We will generate a gradient boosting model.

Click through for more details.  H/T R-Bloggers

Comments closed

Regression Trees And Double Seasonal Time Series Trends

Peter Laurinec walks us through an example of using regression trees to solve a problem with double-seasonal time series data in R:

Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because of these factors:

  • simple to understand (white box)
  • from a tree we can extract interpretable results and make simple decisions
  • they are helpful for exploratory analysis as binary structure of tree is simple to visualize
  • very good prediction accuracy performance
  • very fast
  • they can be simply tuned by ensemble learning techniques

But! There is always some “but”, they poorly adapt when new unexpected situations (values) appears. In other words, they can not detect and adapt to change or concept drift well (absolutely not). This is due to the fact that tree creates during learning just simple rules based on training data. Simple decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. You would ask now, so why we are talking about time series forecasting with regression tree together, right? I will explain how to deal with it in more detail further in this post.

This was a very interesting article.  Absolutely worth reading.  H/T R-Bloggers

Comments closed

R Services Internal Communication Mechanisms

Niels Berglund continues his R Services internals series:

When browsing for the symbols, you can use this command: x /1 *!TCP*. By using the option /1 you’ll only see the names, and no addresses. On my machine that gives me quite a lot, but there are two entries that catch my eye: sqllang!Tcp::AcceptConnection and sqllang!Tcp::Close. So let us set breakpoints at those two symbols, and see what happens when we execute our code.

The result when executing the code is that we initially break at sqllang!Tcp::AcceptConnection. Followed somewhat later by breaking at sqllang!Tcp::Close. Cool, this seems to work – let us set some more breakpoints and try to figure out the flow of events.

The first half recapitulates his previous findings, and then he incorporates new information in the second half.

Comments closed

Compacting Shared Libraries In R

Dirk Eddelbuettel compacts the tidyverse:

Of course, there is a third way: just run strip --strip-debug over all the shared libraries after the build. As the path is standardized, and the shell does proper globbing, we can just do

$ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so

using a double-wildcard to get all packages (in that R package directory) and all their shared libraries. Users on macOS probably want .dylibon the end, users on Windows want another computer as usual (just kidding: use .dll). Either may have to adjust the path which is left as an exercise to the reader.

When running this against the tidyverse library, shared library sizes dropped from 78 MB down to 6.6 MB.  Not bad for a single command. H/T R-Bloggers

Comments closed

A dplyr Quiz

John Mount wants to know how well you understand dplyr:

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?

dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns.

But, is it an easy task to study and characterize dplyr itself?

Take John’s quiz and find out.  He wasn’t kidding about it being an advanced quiz.

Comments closed