Press "Enter" to skip to content

Day: June 23, 2026

Spark DataFrameWriters

Miles Cole compares two generations of DataFrameWriter:

Most Spark developers learn to write data with df.write long before they ever encounter df.writeTo. It is simple, familiar, and everywhere: choose a format, pick a mode, add a few options, and save the result to a table or path. For years, that mental model worked well enough. Spark was often writing files first and tables second.

But modern lakehouse systems have changed the contract.

Read on to learn how, and what common problem the DataFrameWriterV2 is there to solve.

Leave a Comment

The Importance of Testing Received Wisdom

Mark Wilkinson lays out an argument:

Life is full of “absolutes”. For example, the Star Trek: The Next Generation episode “The Measure of a Man” is often cited as the best episode of the series, and many folks will tell you that you should never adjust max worker threads. But once you take the time to dig in, you realize that “Darmok” is in FACT the best episode of ST:TNG, and you’ll also find a small cohort of folks adjusting max worker threads on all of their SQL Server instances. Are these people just abnoxious contrarians? No. They just did their own testing to validate the common wisdom.

Click through for an example from Mark around 64K allocation unit sizes for NTFS volumes. And I’ll give one on max worker threads. I had a consulting client at one point which had per-customer databases. Each customer was, in general, quite small, so they had thousands of databases on the instance. They also wanted high availability on the system, so they wanted each database mirrored to a different server.

If they didn’t spike max worker threads to extreme levels, the server would have fallen over simply from the weight of all of the open database mirroring connections. The actual server workload was fine and it could handle all of the open worker threads because the large majority were doing nothing. But if a zealous problem-solver popped in, ran a diagnostic, saw that they were violating “best practices,” and “fixed” the problem, that would have been a bad day.

Unrelated but similar story: the one time they did need to fail over due to an emergency, it was also a bad day. Because even if the instances can handle 2500+ databases, it turns out that having them all fail over at the same time on low-powered Azure hardware was not a pleasant experience.

Leave a Comment

Optimizing Polymorphic Associations in Postgres

Andrei Lepikhov continues a thread:

Recently, I looked into how common polymorphic associations actually are in relational databases — a performance-hostile pattern built around a discriminated foreign key that ORMs (Rails, Django, Hibernate), CRM platforms (Salesforce), and 1C generate automatically. The front page of a typical online store, or the activity feed of a CRM, is built by exactly this kind of query: a base table is LEFT JOIN-ed to every possible subtype through a (type, id) pair of columns.

That earlier article answered the question ‘how widespread is this pattern?’ After all, if you’re going to improve something, it helps to know how useful the improvement will be, right? Here, I want to give a sense of how this pattern leads to performance regressions and point out directions in the PostgreSQL optimiser that could make the situation easier.

Much of this is speculative in nature but the three proposed solution ideas are all interesting.

Leave a Comment

Patched SQL Injection Vulnerability in sys.sp_dbmmonitorupdate

Fabiano Amorim digs into a fixed issue:

What makes this case particularly interesting is not just that the vulnerability exists in a trusted system object, but how it works: the injection bypasses a REPLACE-based sanitization attempt through a subtle Unicode character conversion that happens silently during a variable assignment.

The vulnerability was reported to Microsoft and they have since fixed it, but it’s still worth exposing and explaining given how intricate it is. So, that’s what I’ll do in this article.

Click through to see how it works. And of course this database mirroring stored procedure is still hanging around long after database mirroring itself was deprecated. But that’s the downside to deprecation without subsequent removal.

Leave a Comment