Category: R

xgboost and Small Numbers of Subtrees

Published 2019-07-19 by Kevin Feasel

John Mount covers an interesting issue you can run into when using xgboost:

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!

It’s not something you’ll hit very often, but if you’re trying xgboost against a small enough data set with few enough rounds, it is something to keep in mind.

Comments closed

Reinforcement Learning with R

Published 2019-07-18 by Kevin Feasel

Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning:

At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?
There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable e=0.1 is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.

Click through for various cases and a pathfinding example in R. H/T R-Bloggers

Comments closed

Biases in Tree-Based Models

Published 2019-07-12 by Kevin Feasel

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased:

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.

Read on for an example.

Comments closed

R 3.6.1 Available

Published 2019-07-10 by Kevin Feasel

David Smith notes a new version of R is available:

On July 5, the R Core Group released the source code for the latest update to R, R 3.6.1, and binaries are now available to download for Windows, Linux and Mac from your local CRAN mirror.
R 3.6.1 is a minor update to R that fixes a few bugs. As usual with a minor release, this version is backwards-compatible with R 3.6.0 and remains compatible with your installed packages.

Click through for the changes. There is one nice addition around writeClipboard but otherwise it’s a release where you probably update if you’re bothered by a bug it fixes and otherwise skip.

Comments closed

Comparing Poisson Regression to Regressing Against Logs

Published 2019-07-09 by Kevin Feasel

Nina Zumel compares a pair of methods for performing regression when income is the dependent variable:

Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.

This is an interesting post with a great teaser for the next post in the series.

Comments closed

tidylo: Calculating Log Odds in R

Published 2019-07-09 by Kevin Feasel

Julia Silge announces a new package, tidylo:

The package contains examples in the README and vignette, but let’s walk though another, different example here. This weighted log odds approach is useful for text analysis, but not only for text analysis. In the weeks since we’ve had this package up and running, I’ve found myself reaching for it in multiple situations, both text and not, in my real-life day job. For this example, let’s look at the same data as my last post, names given to children in the US.
Which names were most common in the 1950s, 1960s, 1970s, and 1980?

This package looks like it’s worth checking out if you deal with frequency-based problems.

Comments closed

ML Services and Injectable Code

Published 2019-07-09 by Kevin Feasel

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities:

The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.
This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?
No.

Another factor is that you need explicitly to grant EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account

Comments closed

Replicating Linear Models

Published 2019-07-05 by Kevin Feasel

John Mount has an interesting post looking at replicating linear models without training data:

Let’s work an example in R. Suppose we are working with a linear regression model and from our donor system we have extracted the following representation of the model as “intercept” and “betas”.
intercept <- 3 betas <- c(weight = 2, height = 4)
Our goal is to build a linear regression model that has the above coefficients. The way we are going to do this is by building our own synthetic data set such that the regression fit through this data set yields these coefficients.

It’s fairly straightforward to do this for linear models; as things get more complicated, however, the difficulty level spikes.

Comments closed

Microsoft’s R Roadmap

Published 2019-07-05 by Kevin Feasel

David Smith has a review of Microsoft’s R roadmap, focusing on Azure:

The post references this guide to the machine learning services in Azure, along with their supported languages. Services that currently support R include Azure Machine Learning Studio, SQL Server Microsoft Machine Learning Service, Microsoft Machine Learning Server, Azure Data Science Virtual Machine, Azure Databricks, and more.

David links to this strategy post:

The R and Python programming languages are primary citizens for data science on the Azure AI Platform. These are the most common languages for performing data preparation, transformation, training and operationalization of machine learning models; the core components for one’s digital transformation leveraging AI. Yet they are fundamentally different in many aspects, directly affecting not only deployed solutions IT architectures but also but also corporate strategies for developer skills and product supportability.

This series of articles is designed help you understand the options your company and customers have to support and evolve their R strategy.

It’s good to see some of this out in the open for planning purposes.

Comments closed

Using data.table to Add Aggregate Values to Data Frames

Published 2019-07-03 by Kevin Feasel

John Mount shows how you can combine := and by in the data.table package to add a new column with the results of an aggregation in R:

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row.

It’s worth reading up on data.table if you aren’t familiar with the great things it can do.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31