Using K-Means Clustering For Anomaly Detection

Anais Dotis-Georgiou gives us an interesting use case of using k-means clustering along with InfluxDB (a time-series database) to detect anomalies in EKG data:

If you read Part Two, then you know these are the steps I used for anomaly detection with K-means:

  1. Segmentation – the process of splitting your time series data into small segments with a horizontal translation.

  2. Windowing – the action of multiplying your segmented data by a windowing function to truncate the dataset before and after the window. The term windowing gets its name from its functionality: it allows you to only see the data in the window range since everything before and after (or outside the window) is multiplied by zero. Windowing allows you to seamlessly stitch your reconstructed data together.

  3. Clustering – the task of grouping similar windowed segments and finding the centroids in the clusters. A centroid is at the center of a cluster. Mathematically, it is defined by the arithmetic mean position of all the points in the cluster.

  4. Reconstruction – the process of rebuilding your time series data. Essentially, you are matching your normal time series data to the closest centroid (the predicted centroid) and stitching those centroids together to produce the reconstructed data.

  5. Normal Error – The purpose of the Reconstruction is to calculate the normal error associated with the output of your time series prediction.

  6. Anomaly Detection – Since you know what the normal error for reconstruction is, you can now use it as a threshold for anomaly detection. Any reconstruction error above that normal error can be considered an anomaly.

Read the whole thing.  This is a really cool use case of a set of technologies along with a venerable (if sometimes troublesome) algorithm.

Related Posts

Accessing Azure Event Hubs with Python

Neil Gelder shows us how you can write Python code to work with Azure Event Hubs: I’ve supplied these two python scripts in my github repo at the following link. First we need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, […]

Read More

The Costs of Specialization within Data Science

Eric Colson argues in favor of data science generalists rather than specialists: But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systems, client engagement bandits, style preference classification, size matching, fashion design systems, logistics optimizers, seasonal trend detection, and more can’t be […]

Read More


November 2018
« Oct Dec »