Press "Enter" to skip to content

Curated SQL Posts

Solving Kusto Detective Academy Case 4

Tom Zika wears a trench coat and works out of a run-down office in a mess of a town, all in black and white:

Someone hacked into Digitown’s municipality, stole classified documents, and vanished. All we have is 30 days of router logs and a lookup table. Time to find a needle in 45 million rows of hay.

I really like the work the Kusto team has put into Kusto Detective Academy. Tom’s blog post is a spoiler if you want to answer it yourself, of course.

Leave a Comment

Microsoft.Data.SqlClient Connection error in SSIS 2025

Andy Leonard troubleshoots (with assistance from Claude) a tricky error:

Recently, I was developing an SSIS package that read from an Azure SQL DB table and wrote to an on-premises SQL Server 2025 database table. I encountered the following error:

[ADO NET Source [41]] Error: ADO NET Source has failed to acquire the connection {<ConnectionManagerGUID>} with the following error message: “Could not load file or assembly ‘Microsoft.Data.SqlClient, Version=6.0.0.0, Culture=neutral, PublicKeyToken=<PublicKeyToken>’ or one of its dependencies. The located assembly’s manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)”.

The issue has nothing to do with a missing file or assembly.

Leave a Comment

Postgres NULLs and NOT IN

Radim Marek lays out a common issue people experience in PostgreSQL:

NOT IN query can return the wrong answer without telling you. It is valid SQL, it runs without an error, and it hands back a perfectly well-formed result set that happens to be empty when it should not be. No warning, no hint, nothing in the logs: just zero rows where you expected hundreds, and a database that considers it correct.

Almost always the cause is a single NULL sitting somewhere you forgot to look, combined with two keywords you have typed a thousand times: NOT IN. None of it is a Postgres bug. This is exactly what the SQL standard mandates, implemented faithfully. That is precisely what makes it so easy to walk into, and why the planner could not safely optimize around it for the better part of Postgres’s history. It comes down to one if statement in the parser.

This is a Postgres-specific problem, as the same code runs successfully in SQL Server. But if you are working with Postgres, it’s good to keep track of this behavior, and Radim has solid advice for a proper workaround.

2 Comments

Noise in CRAN Package Additions

Joseph Rickert shows a consequence of lowering the bar for application development:

If you are reading this post on R-bloggers, you will probably know that I have been publishing my selection of the “Top 40” new R packages on CRAN for quite some time. I did this first as part of my work at Revolution Analytics, then on R Views for RStudio and Posit, and now here on R Works. It used to take about a day’s worth of pleasurable work spread out over a month to select forty interesting packages. For a hundred or so packages, I could look at all of the package webpages, download and play with a small number of them. Now, the “Top 40” has become a real hamster-on-the-wheel project. The following plot shows my count of the number of new packages to make it to CRAN since I began publishing on R Works.

Click through to see what Joseph has laid out. The part that surprises me is, historically, CRAN was pretty difficult to get a package into and you typically needed to jump through a certain number of quality gates. I suppose that has to have changed given what Joseph notes around the lack of documentation in many of these new packages.. But it could be that my understanding of it was wrong H/T R-Bloggers.

Leave a Comment

Jobs and Security Objects in Contained Availability Groups

John Morehouse moves some assets between availability group replicas:

In the first post, I introduced contained availability groups and how they bring contained versions of master and msdb along with the Availability Group. That matters because many applications depend on more than just user databases.

Two of the biggest wins are SQL Agent jobs and security objects.

Click through to see how they work.

Leave a Comment

Vibe Coding and Maintenance

Buck Woody has an essay:

Artificial Intelligence constructs, from Large Language Models answering questions to Agentic AI that runs various workflows are fantastic, amazing, helpful tools in getting a job done. They aren’t quite completely automating entire tasks (The best ones as of this writing are correctly implementing around one out of three tasks accurately: https://llm-stats.com/benchmarks/apex-agents) but they are still a very helpful tool. “Vibe Coding” which means explaining to a model that can write code (or a Codex) what you want the code to do, trying it out, then correcting it until it does the thing, is prevalent everywhere now. And it’s easy to do.

But the code a Codex creates meets a single need: to ship.

This matches pretty well with what I’ve seen. You can definitely build something, which may be good enough for single-person use. But maintenance is a separate story altogether and raises the old adage that you can only maintain code less sophisticated than your knowledge level. Between that and cognitive overload, you can easily end up with a code base that you can’t understand.

Leave a Comment

The Benfit of Disabling V-Order in Fabric Dataflows Gen2

Chris Webb covers a specific use case:

Quite a few new Dataflows Gen2 features were released recently without much fanfare, but that doesn’t mean they aren’t important. I will take a look at them all in my next few posts; in this first post I’ll look at the ability to disable V-Order on staged data.

As the (very detailed) documentation for this new feature describes, V-Order is a write-time optimisation for the parquet files that underpin the Delta tables that OneLake uses to store data. It slows down writing data to the tables but means that reading data from them, for example in Power BI Direct Lake mode, is much faster. 

Click through to see how disabling V-Order can make certain staging loads faster.

Leave a Comment

Treating Query Store Abort Hints like a SQL Firewall

Emad Al-Mousa plays around a bit:

It’s a security mechanism designed to filter, and block unauthorized or malicious SQL query being executed against the database system before it reaches the database kernel itself. It acts as a specialized gatekeeper that ensures only “known good” queries are allowed to run. In a sense, you can compare it with WAF (web application firewall) in terms of protection mechanism.

SQL Firewall will provide protection against the following threats and attacks:

SQL Injection

Privilege Escalation

Data Exfiltration

Emad makes it very clear that this is not a viable technique, but it is also a fun enough idea that it’s worth checking out.

Leave a Comment

Client-Local Parallelism in Kafka

Jack Vanlightly continues a series on Apache Kafka:

Broker-visible parallelism is simple to reason about: if each consumer processes records serially, we add more consumers to increase parallelism. But each consumer adds overhead to the brokers: broker-side protocol state, TCP connections, group membership, fetch state, and participation in the consumer or share group protocol. With long processing times and/or high throughput, the required number of parallel workers can easily exceed what is practical to model as broker-visible consumers.

That is where client-local parallelism becomes important. Instead of scaling by adding more consumers, each consumer application can poll records and process them concurrently inside the client. This allows a smaller number of Kafka consumers to drive a much larger amount of parallel work.

Read on to learn how client-local parallelism fits and how it performs compared to consumer groups.

Leave a Comment

Comparative Query Analysis and Planning

Christophe Pettus has a two-parter. The first post covers how a half-dozen data platform technologies handle cost-based analysis:

PostgreSQL has ANALYZE. You run it (or autovacuum runs it for you), it draws a sample of 300 × default_statistics_target rows, and it writes a row per column into pg_statistic: a null fraction, an n-distinct estimate, a most-common-values list, an equi-depth histogram, and a physical-vs-logical correlation. The planner reads those numbers, multiplies selectivities together, costs a handful of join strategies, and picks one. Three join algorithms are on the menu: nested loop, merge join, hash join.

That is the entire shape of the problem, and every cost-based optimizer ever shipped solves the same one. They differ in three places, and only three: where the numbers come from, how stale the numbers are allowed to get, and which plan shapes are even legal to choose between. The algorithms are the boring part. Everybody hash-joins. The interesting part is the bookkeeping.

Then there’s how each of the systems generates a query plan:

Statistics are the input. Planning is what the database does with them: it takes a declarative query, which describes what you want and says nothing about how, and turns it into an executable plan, which is nothing but how. There are two jobs inside that. First, rewrite the query into a logically equivalent but more tractable shape, which is where subquery flattening, predicate pushdown, and view merging live. Second, search the space of physical plans (join orders, join algorithms, access paths) for the cheapest one the cost model can find. The second job is the hard one, because the number of possible join orders for a query grows faster than anyone wants to contemplate, and every database in this article is, underneath, a strategy for not enumerating all of them.

Two questions separate the six systems here. How does each one tame that search space? And once it has an answer, how much will it let you argue with the result? Those sound like the same question. They are not, and the most useful thing this comparison does is pull them apart. A database can search brilliantly and refuse you any override at all (Snowflake), search crudely and hand you a fistful of hints anyway (MySQL until recently), or search hard and expose every lever ever machined (Oracle). Sophistication of the search and generosity of the control surface are independent axes. Knowing where a system sits on each tells you most of what its planner feels like to live with.

Slightly odd is that there’s a section of DB2 but not on SQL Server. But it is a good cross-comparison of several of the top relational database options.

Leave a Comment