Press "Enter" to skip to content

Curated SQL Posts

Client-Local Parallelism in Kafka

Jack Vanlightly continues a series on Apache Kafka:

Broker-visible parallelism is simple to reason about: if each consumer processes records serially, we add more consumers to increase parallelism. But each consumer adds overhead to the brokers: broker-side protocol state, TCP connections, group membership, fetch state, and participation in the consumer or share group protocol. With long processing times and/or high throughput, the required number of parallel workers can easily exceed what is practical to model as broker-visible consumers.

That is where client-local parallelism becomes important. Instead of scaling by adding more consumers, each consumer application can poll records and process them concurrently inside the client. This allows a smaller number of Kafka consumers to drive a much larger amount of parallel work.

Read on to learn how client-local parallelism fits and how it performs compared to consumer groups.

Leave a Comment

Comparative Query Analysis and Planning

Christophe Pettus has a two-parter. The first post covers how a half-dozen data platform technologies handle cost-based analysis:

PostgreSQL has ANALYZE. You run it (or autovacuum runs it for you), it draws a sample of 300 × default_statistics_target rows, and it writes a row per column into pg_statistic: a null fraction, an n-distinct estimate, a most-common-values list, an equi-depth histogram, and a physical-vs-logical correlation. The planner reads those numbers, multiplies selectivities together, costs a handful of join strategies, and picks one. Three join algorithms are on the menu: nested loop, merge join, hash join.

That is the entire shape of the problem, and every cost-based optimizer ever shipped solves the same one. They differ in three places, and only three: where the numbers come from, how stale the numbers are allowed to get, and which plan shapes are even legal to choose between. The algorithms are the boring part. Everybody hash-joins. The interesting part is the bookkeeping.

Then there’s how each of the systems generates a query plan:

Statistics are the input. Planning is what the database does with them: it takes a declarative query, which describes what you want and says nothing about how, and turns it into an executable plan, which is nothing but how. There are two jobs inside that. First, rewrite the query into a logically equivalent but more tractable shape, which is where subquery flattening, predicate pushdown, and view merging live. Second, search the space of physical plans (join orders, join algorithms, access paths) for the cheapest one the cost model can find. The second job is the hard one, because the number of possible join orders for a query grows faster than anyone wants to contemplate, and every database in this article is, underneath, a strategy for not enumerating all of them.

Two questions separate the six systems here. How does each one tame that search space? And once it has an answer, how much will it let you argue with the result? Those sound like the same question. They are not, and the most useful thing this comparison does is pull them apart. A database can search brilliantly and refuse you any override at all (Snowflake), search crudely and hand you a fistful of hints anyway (MySQL until recently), or search hard and expose every lever ever machined (Oracle). Sophistication of the search and generosity of the control surface are independent axes. Knowing where a system sits on each tells you most of what its planner feels like to live with.

Slightly odd is that there’s a section of DB2 but not on SQL Server. But it is a good cross-comparison of several of the top relational database options.

Leave a Comment

Building a Type-2 Slowly Changing Dimension

Kristyna Ferris builds a dimension:

This is a blog that I am writing for future me and hopefully it’ll help a few of you save some time too! It’s not often that I get to build out a data warehouse from scratch, but when I do, I want to make sure I do it well with best practices in place. Because this is not something I do a lot of, I frequently forget lessons I’ve learned and have to go back and drop tables to recreate them in the best way before it’s too late. One table type that is vital to do right the first time is a Slowly Changing Dimension Type 2 (SCD2 for short).

Click through for an explanation, as well as example scripts for both SQL Server-adjacent products and the Microsoft Fabric warehouse.

Leave a Comment

Multiple Domain Controllers and Kerberos Double-Hops

Randy Knight sorts out a problem:

We recently had a customer where Windows logins to a SQL Server Availability Group started throwing “Cannot generate SSPI context” — but only sometimes, and only for some people. The same user, on the same workstation, would fail one minute and connect the next. Applications were fine. Nothing had changed on SQL Server.

It took the better part of two days and a lot of second-guessing to run down, and the culprit turned out to be something most of us never think about: which domain controller handed out the Kerberos ticket.

Click through for the answer. Surprisingly, it has nothing to do with SPNs for once.

Leave a Comment

Deprecated SQL Server Features since 2016

Jeff Iannucci has a list:

If you’re planning on upgrading from SQL Server 2016 to the current version of SQL Server 2025, it would be helpful to note some of the features from versions since 2016 that are no longer supported. By that I’m talking about those things classified as “deprecated” (no longer supported) or “discontinued” (removed from the product). Let’s take a look at these features and say our goodbyes, organized by versions.

There’s not a lot on here that would affect most companies. Though to one of Jeff’s points in the article, I actually have seen a company use lightweight pooling, mostly because they didn’t know what it did and saw the bit about how it could improve performance. I came in and flipped that switch off for them. That was the kind of feature that likely benefited one or two large customers 30 years ago or so and it just hung around in the product due to inertia.

Leave a Comment

Power BI’s Row-Level Security Bitmap Changes

Phil Seamark lays out the consequences of a change:

If you have ever watched a model with row-level security run beautifully for months and then quietly fall off a cliff, this one is for you. Nothing in the model changed. You did not touch the security rules. One table just grew past a line you could not see, and every query under that role started doing a great deal more work than it used to.

That line just moved, and it moved in your favour. Here is what it is and why it sits where it does.

Click through to learn more about how Power BI handles row-level security predicate matching.

Leave a Comment

Building a Type-6 Slowly Changing Dimension

Dinesh Asanka creates a dimenson:

In a data warehouse, one important concept is to retain historical data. This data is typically not available in operational systems. One approach in data warehouses is the use of Slowly Changing Dimensions (SCDs). What are the SCD options and are there any new approaches?

Click through for a quick depiction of Types 0 through 3, and then where 6 fits into the mix. I’m not 100% sure I’ve ever actually used a Type-6 slowly changing dimension in a production environment, though there are specific circumstances in which one could be quite useful.

Leave a Comment

Query Execution vs Query Processing

Louis Davidson disambiguates a pair of terms:

There have been a lot of posts on LinkedIn of late about the “logical execution order” of a query that all really miss some really big points. I was corrected myself in some terminology because I mistook the term “processing” to mean the same as “order” in these discussions when I was explaining why logical execution order is not what people expected.

Click through for a good explanation, as well as a plan to have your company pay for your Disney World trip.

I also appreciate how Louis still calls it SQL Sentry Plan Explorer because that’s what it always will be in my heart.

Leave a Comment

On-Prem Is Still On-Prem

Andy Brownsword reminds us that it’s still the same animal:

Cloud solutions have sliders which magically reduce downtime, performance issues, and account credit. When you build it, infrastructure and platforms are your challenge, and you get one shot to right-size it. Capacity, resilience, capabilities. They’re your challenges to solve. There is no cake slider.

It’s the capabilities I want to consider here. The capabilities we have available to us day-to-day. Potentially two opposing sides:

To summarize things, everything is awful, just as it always has been. Granted, that may not be how Andy would summarize things, but that’s why I get those big Curated SQL bucks.

Leave a Comment

Re-Migration and Data Engineering

Andy Leonard shares some thoughts:

Right-sizing didn’t always work out the way some clients were led to believe it would.

In nearly every instance, the right-sizing argument was presented (sold) as the solution to over-provisioning, or purchasing hardware to serve peak loads. The classic example was a US income tax service that needed more and faster compute available to meet increasing demand starting in late January and peaking in mid-April each calendar year. After mid-April, hardware that was beefy enough to handle that peak load sat mostly idle for the next 9 months.

I don’t think I’ve ever worked for a company where this scenario really made sense. Even in the e-commerce company where a sizable fraction of our total annual revenues happened over a 5-day period, the load was still significant enough the rest of the year that we made good use of our on-premises SQL Server hardware.

Leave a Comment