Press "Enter" to skip to content

Category: R

Parallelizing Linear Regression With MapReduce

Arthur Charpentier shows us the math behind using MapReduce to parallelize a linear regression:

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Arthur gives us an interesting example in R to boot.

Comments closed

Granting Non-Admin Users Access To Run ML Services

Niels Berglund walks through the rights needed for a non-administrative user to execute an external script using SQL Server Machine Learning Services:

Oops, something did go wrong, as it turns out that if you try to grant permissions on extended stored procedures, which SPEES is, you need to do it from the master database. Cool, let us switch to master and do it there. Well, if you try to do that – then you get another error: the user does not exist in master, sigh!

At this stage you have a couple of options:

  • Add the login for the user to the sysadmin role, or the user to the db_owner role in the actual database. No do not do that, I am only kidding! Do.Not.Do.That!

  • Create the user in master and grant the permission. That would work.

  • Grant the permission to public.

Check it out, as there are two parts to the process.

Comments closed

Using DALEX To Explain Black-Box Models

Przemyslaw Biecek explains that there’s more than LIME for explaining black-box models:

I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?

So here you have a map of different visual explanations for black-box models.

Check out DALEX, which includes a Jupyter notebook example.  H/T R-Bloggers

Comments closed

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R:

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).

  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

This is just one example, but the results are about what I’d expect.

Comments closed

The Dangers Of The Ellipsis In R

John Mount shows us an example where ... (the ellipsis) can come back to hurt us:

The following code example contains an easy error in using the Rfunction unique().

vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
unique(vec1, vec2)
# [1] "a" "b" "c"

Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

John makes it clear that ... is not itself a bad thing, just that there is a time and a place for it and misusing it can lead to hard-to-understand bugs.

Comments closed

wrapr 1.5.0 Now On CRAN

John Mount announces wrapr 1.5.0:

wrapr includes a lot of tools for writing better R code:

John also includes an example using the coalesce operator %?%.

Comments closed

Methods For Detecting Anomalies In Business Metrics

Sergey Bryl’ gives us four methods for detecting anomalies in business data:

In this article, by  business metrics, we mean numerical indicators we regularly measure and use to track and assess the performance of a specific business process. There is a huge variety of business metrics in the industry: from conventional to unique ones. The latter are specifically developed for and used in one company or even just by one of its teams. I want to note that usually, a business metrics have dimensions, which imply the possibility of drilling down the structure of the metric. For instance, the number of sessions on the website can have dimensions: types of browsers, channels, countries, advertising campaigns, etc. where the sessions took place. The presence of a large number of dimensions per metric, on the one hand, provides a comprehensive detailed analysis, and, on the other, makes its conduct more complex.

Anomalies are abnormal values of business indicators. We cannot claim anomalies are something bad or good for business. Rather, we should see them as a signal that there have been some events that significantly influenced a business process and our goal is to determine the causes and potential consequences of such events and react immediately. Of course, from the business point of view, it is better to find such events than ignore them.

It was interesting comparing the results of the four methods.  H/T R-bloggers

Comments closed

Microsoft R Open 3.5.0 Released

David Smith announces that Microsoft R Open 3.5.0 is now available:

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future).

Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date.

It’s nice to see Microsoft keeping pace with R changes; they look like they’re averaging about 6-8 weeks from an R point release to an MRO release.

Comments closed

rqdatatable — Wrangling Lots Of Data, Fast

John Mount explains the motivation behind rqdatatable and puts together a performance test:

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

Click through for the benchmark and information on how to grab the package before it goes into CRAN.

Comments closed

Taking Screenshots With R

Abdul Majed Raja shows us how to take screenshots of webpages using R:

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like ‘.jpg’, ‘.png’, ‘.pdf’ based on which the output file is rendered. Below is the basic structure of how the function goes:

library(webshot)

#webshot(url, filename.extension)
webshot(“https://www.listendata.com/”, “listendata.png”)

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases – starting with downloading/converting a webpage as a PDFcopy.

This isn’t something I’d expect to do every day, but I could see it being useful as part of a notebook to give the user a sanity check, like if a webpage or data set has a last updated timestamp that you want to check.  H/T R-Bloggers

Comments closed