Press "Enter" to skip to content

Curated SQL Posts

Dataflows Gen2 Tips and Tricks

Jon Vöge provides advice on the least beloved ELT process:

Dataflows Gen2 are frequently (and often rightfully so) bashed for their performance inefficiencies. Especially in comparison with other ingestion and transformation tools in Fabric (Notebooks, Pipelines, Copy Jobs, SPROCs).

The fact remains however, that in the hands of a self-service developer, they are an incredibly powerful tool – if you can spare the compute on your capacity.

In this article, I will highlight tips and tricks to make the most of working with Dataflow Gen2 in Fabric. The list is by no means exhaustive, but simply consists of a bunch of tips which I found useful in the past year, including new and overlooked features, as well as old best practices:

Read on for some things that are new to Dataflows Gen2, working with SharePoint, and making data loads not quite as slow.

Leave a Comment

Statistics on Partitioned Tables in PostgreSQL

Laurenz Albe gathers stats:

I recently helped a customer with a slow query. Eventually, an ANALYZE on a partitioned table was enough to fix the problem. This came as a surprise for the customer, since autovacuum was enabled. So I decided to write an article on how PostgreSQL collects partitioned table statistics and how they affect PostgreSQL’s estimates.

Read on to see how it works and how you can generate statistics at the table level and not just the partition level.

Leave a Comment

The Internals of a Hash Table

Hugo Kornelis digs deep:

In part 1 of this series, I laid the foundation to explore the structure of the hash table, as used by the Hash Match operator, by alleging and then proving that a Hash Match (Left Outer Join) returns unmatched rows from the build input in the order in which they are stored in the hash table. This means that we can create queries on carefully curated data to gain insight in the structure of that hash table.

It is now time to use that trick to actually start to explore the hash table. But not without also looking at available documentation and common sense.

Click through for a waltz down memory lane, a graphical interpretation of a hash table, and some tests to see if Hugo is correct.

Leave a Comment

More Fun with Page Latches

Jared Poche continues a series on page latches:

In my previous blog, I set up a database with two tables, one with a large CHAR(8000) field and one with a smaller VARCHAR(100) field. Both tables use an INT IDENTITY column for their primary key. Since we’ll be inserting rows sequentially, we will see page latch contention when multiple threads attempt to insert.

We ran some initial tests with SQLQueryStress to create some page latch contention and resolved an odd problem causing connection delays.

We’ll use these two tables and test several different approaches to reduce page latch contention.

Jared shows the results for a variety of different tests and even has an embedded Excel spreadsheet, which is how you know he’s done his homework.

Leave a Comment

An Introduction to Bayesian Regression

Ivan Palomares Carrascosa covers the concept of Bayesian regression:

In this article, you will learn:

  • The fundamental difference between traditional regression, which uses single fixed values for its parameters, and Bayesian regression, which models them as probability distributions.
  • How this probabilistic approach allows the model to produce a full distribution of possible outcomes, thereby quantifying the uncertainty in its predictions.
  • How to implement a simple Bayesian regression model in Python with scikit-learn.

My understanding is that both Bayesian and traditional regression techniques get you to (roughly) the same place, but the Bayesian approach makes it harder to forget that the regression line you draw doesn’t actually exist and everything has uncertainty.

Leave a Comment

The Downside of Sticking to the Legacy Cardinality Estimator

Stephen Planck recommends taking the plunge:

Cardinality estimation (CE) is how the optimizer predicts the number of rows that will flow through each operator in a plan. Those estimates drive cost, join choices, memory grants, and ultimately latency and resource usage. SQL Server has shipped multiple CE models over time. The pre-2014 model—commonly called the legacy CE—dates back to SQL Server 7.0. Starting in SQL Server 2014, Microsoft introduced a new CE and has continued refining it in later releases, including SQL Server 2022. Keeping the legacy CE turned on in SQL Server 2022 is usually the wrong long-term choice.

One thing to note is that the “new” cardinality estimator has been out for a decade. It’s not really that new anymore, and it’s not going anywhere soon. Yeah, there are still trade-offs where some queries are better on the legacy estimator, but for the people who use that as their reason for not using the new estimator, what have you done in the past decade to address this and tune those queries to work better? If the answer is “nothing,” it’s not the cardinality estimator’s fault here.

1 Comment

Fun with the Data API Builder

Jess Pomfret tries out the Data API Builder:

I’ve been hearing about the Data API Builder (dab) for a while now, but I hadn’t found a reason to play with it myself.

Well I recently found I had a SQL Server database that could use an API so I could interact with it from an Azure Function. I immediately thought about DAB and was excited to have a reason to test it out.

Let me tell you – this thing is pretty neat!

Jess has started a new series and the first post involves installing and trying out the service.

Leave a Comment

Error 845 Timeout in DBCC CHECKTABLE

Eitan Blumin troubleshoots an odd issue:

A customer reported that running DBCC CHECKTABLE on several different tables kept failing with the exact same error:

Msg 845, Sev 17: Time-out occurred while waiting for buffer latch type 4 for page (1:27527325), database ID 10.
Msg 1823, Sev 17: A database snapshot cannot be created because it failed to start.
Msg 7928, Sev 17: The database snapshot for online checks could not be created.

Read on to learn more about Eitan’s troubleshooting process, what the cause of the issue was, and the fixes (both the immediate and complete ones) needed to resolve the issue.

Leave a Comment

Worst-Case Testing for Direct Lake Semantic Models

Chris Webb updates a prior post:

Two years ago I wrote a detailed post on how to do performance testing for Direct Lake semantic models. In that post I talked about how important it is to run worst-case scenario tests to see how your model performs when there is no model data present in memory, and how it was possible to clear all the data held in memory by doing a full refresh of the semantic model. Recently, however, a long-awaited performance improvement for Direct Lake has been released which means a full semantic model refresh may no longer page all data out of memory – which is great, but which also makes running performance tests a bit more complicated.

Read on to learn more about the improvement as well as how you can still perform your performance testing.

Leave a Comment