Press "Enter" to skip to content

The Basics of k-Means Clustering

Nathaniel Schmucker explains some of the principles of k-means clustering:

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis (e.g., you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions (e.g., length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

There’s a lot of depth in this article which makes it really interesting.