Press "Enter" to skip to content

Day: January 19, 2026

Writing Sparse Pandas DataFrames to S3

Pooja Chhabra tries a few things:

If you’ve worked with large-scale machine learning pipelines, you must know one of the most frustrating bottlenecks isn’t always found in the complexity of the model or the elegance of the architecture — it’s writing the output efficiently.

Recently, I found myself navigating a complex data engineering hurdle where I needed to write a massive Pandas sparse DataFrame — the high-dimensional output of a CountVectorizer — directly to Amazon S3. By massive, I mean tens of gigabytes of feature data stored in a memory-efficient sparse format that needed to be materialized as a raw CSV file. This legacy requirement existed because our downstream machine learning model was specifically built to ingest only that format, leaving us with a significant I/O challenge that threatened to derail our entire processing timeline.

Read on for two major constraints, a variety of false starts, and what eventually worked.

Leave a Comment

When Wide Queries Become Slow in SQL Server

Kendra Little talks baggage:

I see this pattern repeatedly: a “wide” query that returns many columns and less than 100k rows runs slowly. SQL Server gets slow when it drags large amounts of baggage through the entire query plan, like a solo traveler struggling with massive suitcases in an airport instead of picking them up close to their destination.

SQL Server often minimizes data access by grabbing all the columns it needs early in query execution, then doing joins and filters. This means presentation columns get picked up early.

Read on to see the effects of this, as well as what you can do to mitigate the issue.

Leave a Comment

Recent Security Updates for SQL Server

John Deardurff puts together a list:

Here is a roundup of recent security updates for SQL Server from the SQL Server Blog announcements.

Read on for links to recent security updates, as well as end of support dates for SQL Server versions 2016 and 2019. John forgot to include 2017 in there, but we’ve still got another year of extended support for that one.

John also clarifies the difference between the CU and GDR paths for SQL Server and when you might choose one versus the other.

Leave a Comment

A PostgreSQL Query Plan that Changes without Data or Stats Changes

Frederic Yhuel troubleshoots an issue:

We recently encountered a strange optimizer behaviour, reported by one of our customers:

Customer: “Hi Dalibo, we have a query that is very slow on the first execution after a batch process, and then very fast. We initially suspected a caching effect, but then we noticed that the execution plan was different.”

Dalibo: “That’s a common issue. Autoanalyze didn’t have the opportunity to process the table after the batch job had finished, and before the first execution of the query. You should run the VACUUM ANALYZE command (or at least ANALYZE) immediately after your batch job.”

Customer: “Yes, it actually solves the problem, but… your hypothesis is wrong. We looked at pg_stat_user_tables, and are certain that the tables were not vacuumed or analyzed between the slow and fast executions. We don’t have a production problem, but we would like to understand.”

Dalibo: “That’s very surprising! we would also like to understand…”

So let’s dive in!

Read on for a description of the issue and what Frederic and team found.

Leave a Comment

Workload Simulation in PostgreSQL

Dave Page announces a new load testing tool:

Most database benchmarking tools focus on raw throughput: how many queries per second can the database handle at maximum load? Whilst this is valuable information, it tells us little about how a system will cope with real-world usage patterns.

Consider a typical e-commerce platform. Traffic peaks during lunch breaks and evenings, drops off overnight, and behaves differently at weekends compared to weekdays. A stock trading application has intense activity during market hours and virtually none outside them. These temporal patterns matter enormously for capacity planning, replication testing, and failover validation.

Click through for more information, as well as a link to the pgEdge Load Generator GitHub repo.

Leave a Comment