Clustering Text via Embeddings and HDBSCAN

Ivan Palomaras Carrascosa groups things together:

In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

Topics we will cover include:

How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.

How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.

How to apply HDBSCAN to automatically discover topic clusters and visualize the results.

This is a pretty neat trick that takes advantage of the embedding model’s ability to convert raw text into hundreds (or thousands) of floating point numbers while maintaining enough of the context to differentiate ideas. A lot of it is the original word2vec concepts but scaled up.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30