Press "Enter" to skip to content

Using Random Cut Forests for Anomaly Detection

Chris Swierczewski and Lai Jiang have an example of using Random Cut Forests to perform anomaly detection against a dataset stored in Amazon Elasticsearch Service:

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, seeĀ Optimal Quantile Approximation in Streams.

The linked post is more of an explanation of process than a tutorial, but it’s interesting in seeing how different approaches can find anomalies at different rates.