Press "Enter" to skip to content

Day: January 27, 2025

An Overview of HyperLogLog

Bhala Ranganathan talks about a powerful algorithm:

Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.

HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT() function to make it so much faster than a regular COUNT(DISTINCT) while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.

Leave a Comment

Partitioned Tables and Indexes in PostgreSQL

Hettie Dombrovskaya runs into an error:

Here is a story. When anyone gives a talk about partitions, they always bring up an example of archiving: there is a partitioned table, and you only keep “current” partitions, whatever “current” means in that context, and after two weeks or a month, or whatever interval works for you, you detach the oldest partition from the “current” table and attach it to the “archived” table, so that the data is still available when you need it, but it does not slow down your “current” queries.

So here is Hettie confidently suggesting that a customer implement this technique to avoid querying a terabyte-plus-size table. A customer happily agrees, and life is great until one day, an archiving job reports an error of a “name already exists” for an index name.

Read on to learn why.

Leave a Comment

MDX vs DAX for Bulk Data Extraction from Power BI

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Leave a Comment

SQL Server and File Access

Vlad Drumea troubleshoots an issue:

This is one of those things that’s obvious for anyone with a bit of SQL Server and Windows experience, but stumps a lot of newcomers.
Especially when it comes to students going through tasks like creating and restoring databases.

It’s also one of those things that I have to explain on a regular basis, so might as well have detailed explanation with examples I can direct people to in the future.

The latest example being this reddit post in r/SQL, where the wrongest answer has the highest upvotes.

Read on for a story about Operating system error 5 (Access is denied). And if I hear anybody switching the virtual service account running SQL Server to Local System, I will personally tell you it’s an awful idea and to stop it.

Leave a Comment

Microsoft Fabric for the SMB

Eugene Meidinger looks out for the smaller fish in the pond:

If you are a small (or even medium) business, you may be wondering “What is Fabric and do we even need it?” If you are primarily on Power BI Pro licenses today, you may not find a compelling reason to switch to Fabric today, but the value add should improve over time as new features are added on the Fabric side and some features get deprecated on the Power BI side.

Read on for plenty of advice, metaphors, and even a few warnings.

Leave a Comment