Press "Enter" to skip to content

Text Clustering with Python

Luke Menzies takes us through the gensim library:

An interesting branch of machine learning is Natural Language Processing (NLP). As the name suggests, it involves training machines to detect patterns in language using algorithms. It is quite often the case that NLP is referred to as text analytics. It is actually more impressive than that. It examines vectorised patterns which not only looks at the positioning of elements but what it means in context to neighbouring elements within the vector. In a nutshell, this technique can be extended beyond text to patterns of linguistics in general and even contextual patterns. Nevertheless, its primary use in the machine learning world is to analyse text.

This article will focus on an interesting application of NLP which involves the clustering of text. Clustering is a popular unsupervised machine learning technique used for segmentation or grouping of data. It is a very powerful tool that is used across a variety of industries. However, it is rare you hear of applying clustering to text. This can be achieved using NLP functions, combined with clustering algorithms that can handle non-Euclidian distances.

Read on for an overview of the process and an example of combining DBSCAN with word2vec to cluster phrases.