Press "Enter" to skip to content

Curated SQL Posts

An Overview of HyperLogLog

Bhala Ranganathan talks about a powerful algorithm:

Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.

HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT() function to make it so much faster than a regular COUNT(DISTINCT) while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.

Comments closed

Partitioned Tables and Indexes in PostgreSQL

Hettie Dombrovskaya runs into an error:

Here is a story. When anyone gives a talk about partitions, they always bring up an example of archiving: there is a partitioned table, and you only keep “current” partitions, whatever “current” means in that context, and after two weeks or a month, or whatever interval works for you, you detach the oldest partition from the “current” table and attach it to the “archived” table, so that the data is still available when you need it, but it does not slow down your “current” queries.

So here is Hettie confidently suggesting that a customer implement this technique to avoid querying a terabyte-plus-size table. A customer happily agrees, and life is great until one day, an archiving job reports an error of a “name already exists” for an index name.

Read on to learn why.

Comments closed

MDX vs DAX for Bulk Data Extraction from Power BI

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Comments closed

SQL Server and File Access

Vlad Drumea troubleshoots an issue:

This is one of those things that’s obvious for anyone with a bit of SQL Server and Windows experience, but stumps a lot of newcomers.
Especially when it comes to students going through tasks like creating and restoring databases.

It’s also one of those things that I have to explain on a regular basis, so might as well have detailed explanation with examples I can direct people to in the future.

The latest example being this reddit post in r/SQL, where the wrongest answer has the highest upvotes.

Read on for a story about Operating system error 5 (Access is denied). And if I hear anybody switching the virtual service account running SQL Server to Local System, I will personally tell you it’s an awful idea and to stop it.

Comments closed

Microsoft Fabric for the SMB

Eugene Meidinger looks out for the smaller fish in the pond:

If you are a small (or even medium) business, you may be wondering “What is Fabric and do we even need it?” If you are primarily on Power BI Pro licenses today, you may not find a compelling reason to switch to Fabric today, but the value add should improve over time as new features are added on the Fabric side and some features get deprecated on the Power BI side.

Read on for plenty of advice, metaphors, and even a few warnings.

Comments closed

Building the Scaffolding of a Powershell Module

Adam Bertram has a series on creating Powershell modules:

In this guide, we’ll create a PowerShell module for gathering computer hardware information. This module will help system administrators collect and report on memory, storage, and processor details across multiple systems.

Our module will feature:

  • Functions to gather specific hardware information
  • Remote system support using PowerShell sessions
  • Standardized output format for consistent reporting

Click through for the article.

Comments closed

The JSON Data Type in Azure SQL DB

Koen Verbeeck takes a peek at what’s coming in SQL Server 2025:

We have data coming into our database from a REST API endpoint. The data is formatted as JSON documents. Is there an efficient way to handle JSON data within the SQL Server ecosystem? In this article, we look at the new JSON data types for SQL Server.

The end result is not particularly clear-cut here: reads slightly faster, writes considerably slower. There aren’t any indexes on the data type at this time, so no opportunity for improvement there.

Comments closed

Using PolyBase for Archiving Data

Andy Yun is speaking my language:

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Read on to learn more. Also, Andy mentions using S3-compatible local storage with PolyBase for local storage. As a spoiler, I have a video coming out on January 28th that covers exactly that same topic, though without the benefit of snappy all-flash storage arrays.

1 Comment