Press "Enter" to skip to content

Category: R

Microsoft R Open 3.5.0 Released

David Smith announces that Microsoft R Open 3.5.0 is now available:

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future).

Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date.

It’s nice to see Microsoft keeping pace with R changes; they look like they’re averaging about 6-8 weeks from an R point release to an MRO release.

Comments closed

rqdatatable — Wrangling Lots Of Data, Fast

John Mount explains the motivation behind rqdatatable and puts together a performance test:

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

Click through for the benchmark and information on how to grab the package before it goes into CRAN.

Comments closed

Taking Screenshots With R

Abdul Majed Raja shows us how to take screenshots of webpages using R:

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like ‘.jpg’, ‘.png’, ‘.pdf’ based on which the output file is rendered. Below is the basic structure of how the function goes:

library(webshot)

#webshot(url, filename.extension)
webshot(“https://www.listendata.com/”, “listendata.png”)

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases – starting with downloading/converting a webpage as a PDFcopy.

This isn’t something I’d expect to do every day, but I could see it being useful as part of a notebook to give the user a sanity check, like if a webpage or data set has a last updated timestamp that you want to check.  H/T R-Bloggers

Comments closed

Native Scoring With SQL Server 2017 R Services

Tomaz Kastrun gives us an example using native scoring in SQL Server 2017 Machine Learning Services:

Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Limitations are:

  • currently supports only SQL server 2017 and Windows platform

  • trained model should not exceed 100 MiB in size

  • Native scoring with PREDICT function supports only following algorithms from RevoScaleR library:

    • rxLinMod (linear model as linear regression)

    • rxLogit (logistic regression)

    • rxBTrees (Parallel external memory algorithm for Stochastic Gradient Boosted Decision Trees)

    • rxDtree (External memory algorithm for Classification and Regression Trees

    • rxDForest (External memory algorithm for Classification and Regression Decision Trees)

Read on for an example.  If you’re using one of these methods, then native scoring is extremely fast and a bit more flexible than I originally anticipated.  The problem is that you have to use one of those methods.

Comments closed

WVPlots 1.0.0

John Mount announces WVPlots 1.0.0:

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN!

The idea is: we sacrifice some of the flexibility and composability inherent to ggplot2 in R for a menu of prescribed presentation solutions. This is a package to produce plots while you are in the middle of another task.

I like this idea:  I know the kind of plot I need and just want to throw something together for myself to give me an idea of the underlying data.

Comments closed

Probabilities And Poker

Steve Miller has a notebook on 5-card draw probabilities:

The population of 5 card draw hands, consisting of 52 choose 5 or 2598960 elements, is pretty straightforward both mathematically and statistically.

So of course ever the geek, I just had to attempt to show her how probability and statistics converge. In addition to explaining the “combinatorics” of the counts and probabilities, I undertook two computational exercises. The first was to delineate all possible combinations of 5 card draws from a 52 card deck, counting occurrences of relevant combinations such as 2 pair, a straight, or nothing in a cell loop.

Steve has made his notebook available for us.

Comments closed

Principal Component Analysis With Stack Overflow Data

Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data:

We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and in this case a sparse matrix makes most sense. Most developers do not visit most technologies so there are lots of zeroes in our matrix. The tidytext package has a function cast_sparse() that takes tidy data and casts it to a sparse matrix.

sparse_tag_matrix <- tag_percents %>% tidytext::cast_sparse(User, Tag, Value)

Several of the implementations for PCA in R are not sparse matrix aware, such as prcomp(); the first thing it will do is coerce the BEAUTIFUL SPARSE MATRIX you just made into a regular matrix, and then you will be sitting there for one zillion years with no RAM left. (That is a precise and accurate estimate from my benchmarking, obviously.) One option that does take advantage of sparse matrices is the irlba package.

This is a great walkthrough of an important topic.

Comments closed

Using xplain To Interpret Model Results

Joachim Zuckarelli walks us through the xplain package in R:

The above XML produces the following output (don’t worry too much about the call of xplain(), we will discuss later on in more detail how to work with the xplain() function):

library(car)
library(xplain)
xplain(call="lm(education ~ young + income + urban, data=Anscombe)",
xml="http://www.zuckarelli.de/xplain/example_lm_foreach.xml")

##
## Call:
## lm(formula = education ~ young + income + urban, data = Anscombe)
##
## Coefficients:
## (Intercept) young income urban
## -286.83876 0.81734 0.08065 -0.10581
##
##
## Interpreting the coefficients
## —————————–
## Your coefficient ‘(Intercept)’ is smaller than zero.
##
## Your coefficient ‘young’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.82 for
## any increase of 1 in your independent variable ‘young’.
##
## Your coefficient ‘income’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.081 for
## any increase of 1 in your independent variable ‘income’.
##
## Your coefficient ‘urban’ is smaller than zero. This means that the
## value of your dependent variable ‘education’ changes by -0.11 for
## any increase of 1 in your independent variable ‘urban’.

I’ll be interested in looking at this in more detail, though my first glance indication is that it’ll be useful mostly in large shops with different teams creating and using models.

Comments closed

Sentiment Analysis Of Hotel California

Sara Locatelli analyzes the lyrics to Hotel California using tidytext:

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

  • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
  • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
  • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

Read the whole thing, though you can’t check out afterward.

Comments closed

Digging Into The SQL Compute Context With R Services

Niels Berglund dives into how the SQL Compute Context works with R Services:

In the code above we use the RxInSqlServer() function to indicate we want to execute in a SQL context. The connectionString property defines where we execute, and the numTasks property sets the number of tasks (processes) to run for each computation, in Code Snippet 4 it is set to 1 which from a processing perspective should match what we do in Code Snippet 3. Before we execute the code in Code Snippet 4 we do what we did before we ran the code in Code Snippet 3:

  • Run Process Explorer as admin.
  • Navigate to the devenv.exe process in Process Explorer.
  • In addition, also look at the Launchpad.exe process in Process Explorer.

When we execute we see that the BxlServer.exe processes under the Microsoft.R.Host.exe processes are idling, but when we look at the Launchpad.exe process we see this:

This is a bit deep but interesting reading.

Comments closed