The Power Of The Stacked Ensemble

Funda Gunes describes the value of ensemble models in data science competitions:

A simple way to enhance diversity is to train models by using different machine learning algorithms. For example, adding a factorization model to a set of tree-based models (such as random forest and gradient boosting) provides a nice diversity because a factorization model is trained very differently than decision tree models are trained. For the same machine learning algorithm, you can enhance diversity by using different hyperparameter settings and subsets of variables. If you have many features, one efficient method is to choose subsets of the variables by simple random sampling. Choosing subsets of variables could be done in more principled fashion that is based on some computed measure of importance which introduces the large and difficult problem of feature selection.

In addition to using various machine learning training algorithms and hyperparameter settings, the KDD Cup solution shown above uses seven different feature sets (F1-F7) to further enhance the diversity.  Another simple way to create diversity is to generate various versions of the training data. This can be done by bagging and cross validation.

I think there’s a pretty strong contrast between competitions and general practice, where you’re doing everything you can to eek out a higher prediction score in the competition, but in practice, you’re aiming to balance a “good enough” prediction with hardware/time requirements and code complexity, and so the model selection process can be quite different.

Related Posts

Bayesian Average

Jelte Hoekstra has a fun post applying the Bayesian average to board game ratings: Maybe you want to explore the best boardgames but instead you find the top 100 filled with 10/10 scores. Experience many such false positives and you will lose faith in the rating system. Let’s be clear this isn’t exactly incidental either: […]

Read More

Calculating Relative Risk In T-SQL

Mala Mahadevan explains how to calculate relative risk using T-SQL: In this post we will explore a common statistical term – Relative Risk, otherwise called Risk Factor. Relative Risk is a term that is important to understand when you are doing comparative studies of two groups that are different in some specific way. The most […]

Read More