Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased:

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.

However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.

Read on for an example.

R 3.6.1 Available

David Smith notes a new version of R is available:

On July 5, the R Core Group released the source code for the latest update to R, R 3.6.1, and binaries are now available to download for Windows, Linux and Mac from your local CRAN mirror.

R 3.6.1 is a minor update to R that fixes a few bugs. As usual with a minor release, this version is backwards-compatible with R 3.6.0 and remains compatible with your installed packages. 

Click through for the changes. There is one nice addition around writeClipboard but otherwise it’s a release where you probably update if you’re bothered by a bug it fixes and otherwise skip.

Comparing Poisson Regression to Regressing Against Logs

Nina Zumel compares a pair of methods for performing regression when income is the dependent variable:

Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.

Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.

This is an interesting post with a great teaser for the next post in the series.

tidylo: Calculating Log Odds in R

Julia Silge announces a new package, tidylo:

The package contains examples in the README and vignette, but let’s walk though another, different example here. This weighted log odds approach is useful for text analysis, but not only for text analysis. In the weeks since we’ve had this package up and running, I’ve found myself reaching for it in multiple situations, both text and not, in my real-life day job. For this example, let’s look at the same data as my last post, names given to children in the US.

Which names were most common in the 1950s, 1960s, 1970s, and 1980?

This package looks like it’s worth checking out if you deal with frequency-based problems.

ML Services and Injectable Code

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities:

The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.

This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?


Another factor is that you need explicitly to grant EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account

Replicating Linear Models

John Mount has an interesting post looking at replicating linear models without training data:

Let’s work an example in R. Suppose we are working with a linear regression model and from our donor system we have extracted the following representation of the model as “intercept” and “betas”.

intercept <- 3 betas <- c(weight = 2, height = 4)

Our goal is to build a linear regression model that has the above coefficients. The way we are going to do this is by building our own synthetic data set such that the regression fit through this data set yields these coefficients.

It’s fairly straightforward to do this for linear models; as things get more complicated, however, the difficulty level spikes.

Microsoft’s R Roadmap

David Smith has a review of Microsoft’s R roadmap, focusing on Azure:

The post references this guide to the machine learning services in Azure, along with their supported languages. Services that currently support R include Azure Machine Learning StudioSQL Server Microsoft Machine Learning ServiceMicrosoft Machine Learning ServerAzure Data Science Virtual MachineAzure Databricks, and more.

David links to this strategy post:

The R and Python programming languages are primary citizens for data science on the Azure AI Platform. These are the most common languages for performing data preparation, transformation, training and operationalization of machine learning models; the core components for one’s digital transformation leveraging AI. Yet they are fundamentally different in many aspects, directly affecting not only deployed solutions IT architectures but also but also corporate strategies for developer skills and product supportability.
This series of articles is designed help you understand the options your company and customers have to support and evolve their R strategy.

It’s good to see some of this out in the open for planning purposes.

Using data.table to Add Aggregate Values to Data Frames

John Mount shows how you can combine := and by in the data.table package to add a new column with the results of an aggregation in R:

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row.

It’s worth reading up on data.table if you aren’t familiar with the great things it can do.

Performing Row-Wise Operations with pmap

Sebastian Sauer shows how you can use pmap in the purrr library to perform row-wise aggregations:

Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.

This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.

The ideal solution is to have your data be properly columnar, but if you’re in a pinch, it’s good to know that you can do this.

Random Forest on Small Numbers of Observations

Neil Saunders takes us through an interesting problem:

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

Neil decided to explore the problem further regardless and came to some interesting conclusions.


July 2019
« Jun