Category: R

Oh no! That’s not what we wanted! R has messed the thing up (?). The reason is that R sees the first factor level internally as the number 1 . The second level as number two. What’s the first factor level in our case? Let’s see:
factor(tips$sex) %>% head()
#> [1] Female Male   Male   Male   Female Male  
#> Levels: Female Male
factor(tips$sex_r) %>% head()
#> [1] 1 0 0 0 1 0
#> Levels: 0 1
That’s confusing: “0” is the first level of sex_r – internally for R represented by “1”. The second level of sex_r is “1” – internally represented by “2”.

Fortunately, we get the easy answer at the end of the post.

Comments closed

Parallelizing Linear Regression With MapReduce

Published 2018-06-25 by Kevin Feasel

Arthur Charpentier shows us the math behind using MapReduce to parallelize a linear regression:

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Arthur gives us an interesting example in R to boot.

Comments closed

Granting Non-Admin Users Access To Run ML Services

Published 2018-06-25 by Kevin Feasel

Niels Berglund walks through the rights needed for a non-administrative user to execute an external script using SQL Server Machine Learning Services:

Oops, something did go wrong, as it turns out that if you try to grant permissions on extended stored procedures, which SPEES is, you need to do it from the master database. Cool, let us switch to master and do it there. Well, if you try to do that – then you get another error: the user does not exist in master, sigh!

At this stage you have a couple of options:

~~Add the login for the user to the sysadmin role, or the user to the db_owner role in the actual database.~~ No do not do that, I am only kidding! Do.Not.Do.That!
Create the user in master and grant the permission. That would work.
Grant the permission to public.

Check it out, as there are two parts to the process.

Comments closed

Using DALEX To Explain Black-Box Models

Published 2018-06-20 by Kevin Feasel

Przemyslaw Biecek explains that there’s more than LIME for explaining black-box models:

I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?

So here you have a map of different visual explanations for black-box models.

Check out DALEX, which includes a Jupyter notebook example. H/T R-Bloggers

Comments closed

Comparing Keras In Python Versus R

Published 2018-06-20 by Kevin Feasel

Dmitry Kisler performs image classification using Keras in both Python and R:

From the plots above, one can see that:

the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

This is just one example, but the results are about what I’d expect.

Comments closed

The Dangers Of The Ellipsis In R

Published 2018-06-19 by Kevin Feasel

John Mount shows us an example where ... (the ellipsis) can come back to hurt us:

The following code example contains an easy error in using the Rfunction unique().
vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
unique(vec1, vec2)
# [1] "a" "b" "c"
Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

John makes it clear that ... is not itself a bad thing, just that there is a time and a place for it and misusing it can lead to hard-to-understand bugs.

Comments closed

wrapr 1.5.0 Now On CRAN

Published 2018-06-15 by Kevin Feasel

John Mount announces wrapr 1.5.0:

wrapr includes a lot of tools for writing better R code:

let() (let block)
%.>% (dot arrow pipe)
build_frame() / draw_frame() ( data.frame builders and formatters )
qc() (quoting concatenate)
:= (named map builder)
%?% (coalesce) NEW!
%.|% (reduce/expand args) NEW!
uniques() (safe unique() replacement) NEW!
partition_tables() / execute_parallel() NEW!
DebugFnW() (function debug wrappers)
λ() (anonymous function builder)

John also includes an example using the coalesce operator %?%.

Comments closed

Methods For Detecting Anomalies In Business Metrics

Published 2018-06-14 by Kevin Feasel

Sergey Bryl’ gives us four methods for detecting anomalies in business data:

In this article, by business metrics, we mean numerical indicators we regularly measure and use to track and assess the performance of a specific business process. There is a huge variety of business metrics in the industry: from conventional to unique ones. The latter are specifically developed for and used in one company or even just by one of its teams. I want to note that usually, a business metrics have dimensions, which imply the possibility of drilling down the structure of the metric. For instance, the number of sessions on the website can have dimensions: types of browsers, channels, countries, advertising campaigns, etc. where the sessions took place. The presence of a large number of dimensions per metric, on the one hand, provides a comprehensive detailed analysis, and, on the other, makes its conduct more complex.

Anomalies are abnormal values of business indicators. We cannot claim anomalies are something bad or good for business. Rather, we should see them as a signal that there have been some events that significantly influenced a business process and our goal is to determine the causes and potential consequences of such events and react immediately. Of course, from the business point of view, it is better to find such events than ignore them.

It was interesting comparing the results of the four methods. H/T R-bloggers

Comments closed

Microsoft R Open 3.5.0 Released

Published 2018-06-11 by Kevin Feasel

David Smith announces that Microsoft R Open 3.5.0 is now available:

Microsoft R Open 3.5.0 is now available for download for Windows, Mac and Linux. This update includes the open-source R 3.5.0 engine, which is a major update with many new capabilities and improvements to R. In particular, it includes a major new framework for handling data in R, with some major behind-the-scenes performance and memory-use benefits (and with further improvements expected in the future).

Microsoft R Open 3.5.0 points to a fixed CRAN snapshot taken on June 1 2018. This provides a reproducible experience when installing CRAN packages by default, but you always change the default CRAN repository or the built-in checkpoint package to access snapshots of packages from an earlier or later date.

It’s nice to see Microsoft keeping pace with R changes; they look like they’re averaging about 6-8 weeks from an R point release to an MRO release.

Comments closed

rqdatatable — Wrangling Lots Of Data, Fast

Published 2018-06-04 by Kevin Feasel

John Mount explains the motivation behind rqdatatable and puts together a performance test:

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.tableitself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

Click through for the benchmark and information on how to grab the package before it goes into CRAN.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31