John Kerski searches for similar sets:
I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.
Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.
John came up with a very clever solution. By the way, the way I like to explain cosine similarity (as a concept, not the algorithm itself) is as follows.
Back in high school physics, you probably drew vectors and learned that vectors have a direction and a magnitude (length). We drew vectors in two-dimensional space because that’s easy: it’s a line on a sheet of paper and there’s an arrow at the end to denote the direction of that vector. Conceptually, vectors with more than two dimensions behave exactly the same; the difference is that we cannot simply draw them, especially once we get past three-dimensional space (a vector with three elements). But the concept is still there: every vector has a direction and a magnitude.
We use cosine similarity to compare two vectors and see how close those two vectors are in terms of angle (direction), with the idea being that magnitude isn’t as important as angle for determining vector similarity. This is in contrast to another technique like Euclidean distance, which focuses more on the magnitude of the vectors versus angle.