Bhala Ranganathan talks about a powerful algorithm:
Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.
HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT()
function to make it so much faster than a regular COUNT(DISTINCT)
while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.