In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
Read on for an example.
R 3.6.1 is a minor update to R that fixes a few bugs. As usual with a minor release, this version is backwards-compatible with R 3.6.0 and remains compatible with your installed packages.
Click through for the changes. There is one nice addition around
writeClipboard but otherwise it’s a release where you probably update if you’re bothered by a bug it fixes and otherwise skip.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
This is an interesting post with a great teaser for the next post in the series.
The package contains examples in the README and vignette, but let’s walk though another, different example here. This weighted log odds approach is useful for text analysis, but not only for text analysis. In the weeks since we’ve had this package up and running, I’ve found myself reaching for it in multiple situations, both text and not, in my real-life day job. For this example, let’s look at the same data as my last post, names given to children in the US.
Which names were most common in the 1950s, 1960s, 1970s, and 1980?
This package looks like it’s worth checking out if you deal with frequency-based problems.
The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.
This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?
Another factor is that you need explicitly to grant
EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account
intercept <- 3 betas <- c(weight = 2, height = 4)
Our goal is to build a linear regression model that has the above coefficients. The way we are going to do this is by building our own synthetic data set such that the regression fit through this data set yields these coefficients.
It’s fairly straightforward to do this for linear models; as things get more complicated, however, the difficulty level spikes.
The post references this guide to the machine learning services in Azure, along with their supported languages. Services that currently support R include Azure Machine Learning Studio, SQL Server Microsoft Machine Learning Service, Microsoft Machine Learning Server, Azure Data Science Virtual Machine, Azure Databricks, and more.
The R and Python programming languages are primary citizens for data science on the Azure AI Platform. These are the most common languages for performing data preparation, transformation, training and operationalization of machine learning models; the core components for one’s digital transformation leveraging AI. Yet they are fundamentally different in many aspects, directly affecting not only deployed solutions IT architectures but also but also corporate strategies for developer skills and product supportability.
This series of articles is designed help you understand the options your company and customers have to support and evolve their R strategy.
It’s good to see some of this out in the open for planning purposes.
by” signals we are doing a per-group calculation, and the “
:=” signals to land the results in the original
data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row.
It’s worth reading up on
data.table if you aren’t familiar with the great things it can do.
Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.
This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.
The ideal solution is to have your data be properly columnar, but if you’re in a pinch, it’s good to know that you can do this.
A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?
The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.
Neil decided to explore the problem further regardless and came to some interesting conclusions.