The Power Of The Stacked Ensemble

Funda Gunes describes the value of ensemble models in data science competitions:

A simple way to enhance diversity is to train models by using different machine learning algorithms. For example, adding a factorization model to a set of tree-based models (such as random forest and gradient boosting) provides a nice diversity because a factorization model is trained very differently than decision tree models are trained. For the same machine learning algorithm, you can enhance diversity by using different hyperparameter settings and subsets of variables. If you have many features, one efficient method is to choose subsets of the variables by simple random sampling. Choosing subsets of variables could be done in more principled fashion that is based on some computed measure of importance which introduces the large and difficult problem of feature selection.

In addition to using various machine learning training algorithms and hyperparameter settings, the KDD Cup solution shown above uses seven different feature sets (F1-F7) to further enhance the diversity.  Another simple way to create diversity is to generate various versions of the training data. This can be done by bagging and cross validation.

I think there’s a pretty strong contrast between competitions and general practice, where you’re doing everything you can to eek out a higher prediction score in the competition, but in practice, you’re aiming to balance a “good enough” prediction with hardware/time requirements and code complexity, and so the model selection process can be quite different.

Related Posts

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

Multi-Threaded R With Microsoft R Client

David Parr shows us how to get started with Microsoft R Client and performs some quick benchmarking: This message will pop up, and it’s worth noting as it’s got some information in it that you might need to think about: It’s worth noting that right now Microsoft r Client is lagging behind the current R version, and […]

Read More