Unsupervised Decision Trees

William Vorhies describes what unsupervised decision trees are:

In anomaly detection we are attempting to identify items or events that don’t match the expected pattern in the data set and are by definition rare.  The traditional ‘signature based’ approach widely used in intrusion detection systems creates training data that can be used in normal supervised techniques.  When an attack is detected the associated traffic pattern is recorded and marked and classified as an intrusion by humans.  That data then combined with normal data creates the supervised training set.

In both supervised and unsupervised cases decision trees, now in the form of random forests are the weapon of choice.  Decision trees are nonparametric; they don’t make an assumption about the distribution of the data.  They’re great at combining numeric and categoricals, and handle missing data like a champ.  All types of anomaly data tend to be highly dimensional and decision trees can take it all in and offer a reasonably clear guide for pruning back to just what’s important.

To be complete, there is also category of Semi-Supervised anomaly detection in which the training data consists only of normal transactions without any anomalies.  This is also known as ‘One Class Classification’ and uses one class SVMs or autoencoders in a slightly different way not discussed here.

Interesting reading.  I’d had no idea that unsupervised decision trees were even a thing.

Related Posts

Housing Prices In Ames, Iowa: A Kaggle Competition

Kathryn Bryant and M. Aaron Owen share their Kaggle experiences.  First, Kathryn, et al: The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and […]

Read More

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist: Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text […]

Read More


October 2017
« Sep Nov »