Press "Enter" to skip to content

Author: Kevin Feasel

Writing Sparse Pandas DataFrames to S3

Pooja Chhabra tries a few things:

If you’ve worked with large-scale machine learning pipelines, you must know one of the most frustrating bottlenecks isn’t always found in the complexity of the model or the elegance of the architecture — it’s writing the output efficiently.

Recently, I found myself navigating a complex data engineering hurdle where I needed to write a massive Pandas sparse DataFrame — the high-dimensional output of a CountVectorizer — directly to Amazon S3. By massive, I mean tens of gigabytes of feature data stored in a memory-efficient sparse format that needed to be materialized as a raw CSV file. This legacy requirement existed because our downstream machine learning model was specifically built to ingest only that format, leaving us with a significant I/O challenge that threatened to derail our entire processing timeline.

Read on for two major constraints, a variety of false starts, and what eventually worked.

Leave a Comment

When Wide Queries Become Slow in SQL Server

Kendra Little talks baggage:

I see this pattern repeatedly: a “wide” query that returns many columns and less than 100k rows runs slowly. SQL Server gets slow when it drags large amounts of baggage through the entire query plan, like a solo traveler struggling with massive suitcases in an airport instead of picking them up close to their destination.

SQL Server often minimizes data access by grabbing all the columns it needs early in query execution, then doing joins and filters. This means presentation columns get picked up early.

Read on to see the effects of this, as well as what you can do to mitigate the issue.

Leave a Comment

Recent Security Updates for SQL Server

John Deardurff puts together a list:

Here is a roundup of recent security updates for SQL Server from the SQL Server Blog announcements.

Read on for links to recent security updates, as well as end of support dates for SQL Server versions 2016 and 2019. John forgot to include 2017 in there, but we’ve still got another year of extended support for that one.

John also clarifies the difference between the CU and GDR paths for SQL Server and when you might choose one versus the other.

Leave a Comment

A PostgreSQL Query Plan that Changes without Data or Stats Changes

Frederic Yhuel troubleshoots an issue:

We recently encountered a strange optimizer behaviour, reported by one of our customers:

Customer: “Hi Dalibo, we have a query that is very slow on the first execution after a batch process, and then very fast. We initially suspected a caching effect, but then we noticed that the execution plan was different.”

Dalibo: “That’s a common issue. Autoanalyze didn’t have the opportunity to process the table after the batch job had finished, and before the first execution of the query. You should run the VACUUM ANALYZE command (or at least ANALYZE) immediately after your batch job.”

Customer: “Yes, it actually solves the problem, but… your hypothesis is wrong. We looked at pg_stat_user_tables, and are certain that the tables were not vacuumed or analyzed between the slow and fast executions. We don’t have a production problem, but we would like to understand.”

Dalibo: “That’s very surprising! we would also like to understand…”

So let’s dive in!

Read on for a description of the issue and what Frederic and team found.

Leave a Comment

Workload Simulation in PostgreSQL

Dave Page announces a new load testing tool:

Most database benchmarking tools focus on raw throughput: how many queries per second can the database handle at maximum load? Whilst this is valuable information, it tells us little about how a system will cope with real-world usage patterns.

Consider a typical e-commerce platform. Traffic peaks during lunch breaks and evenings, drops off overnight, and behaves differently at weekends compared to weekdays. A stock trading application has intense activity during market hours and virtually none outside them. These temporal patterns matter enormously for capacity planning, replication testing, and failover validation.

Click through for more information, as well as a link to the pgEdge Load Generator GitHub repo.

Leave a Comment

The Analyst Engineer Role in Microsoft Fabric

Paul Turley describes a job role:

Analytics Engineer – The DP-600 exam contains elements of both data analytics and data engineering with a business focus. It is the fence-sitter between the disciplined worlds of data engineering and the abstract creativity of analytics. An Analytics Engineer bridges data engineering and analytics by building enterprise-ready solutions in Microsoft Fabric. Combines skills in SQL, DAX, and KQL with expertise in data modeling and performance optimization. They design and manage dataflows, create semantic models, optimize datasets, and ensure governance to deliver curated, analytics-ready data assets for reporting and analysis.

Read on to see how this role fits and some of the things a person performing this role should know and be able to do.

Leave a Comment

Dealing with a Full Transaction Log

Rebecca Lewis performs some troubleshooting:

It’s 2am. Your phone wakes you. Rub your eyes, check your email, and there it is:

Error: 9002, Severity: 17, State: 4
The transaction log for database 'trading' is full due to 'LOG_BACKUP'.

The database is still online. Looks ok. You can read from it. But every INSERT, UPDATE, and DELETE fails. Production night-trading is effectively down.

The good news: It’s fixable — but, that fix depends entirely on what’s preventing log truncation.

Click through for a choose-your-own-adventure story.

Leave a Comment

Thoughts on On-Disk Rowstore in SQL Server

Hugo Kornelis starts a series on storage structures:

When a query is slow, it is often caused by inefficient access to the data. So our tuning work very frequently comes down to figuring out how data was read, and then massaging our queries or database structures to get SQL Server to access the data in a more efficient way.

So we look at scans, seeks, and lookups. We know that scans are good when we access most of the data. Or, in the case of an ordered scan, to prevent having to sort the data. We know that seeks are preferred when there is a filter in the query. And we know that lookups represent a good tradeoff between better performance and too many indexes, but only if the filter is highly selective.

All of the above is true. And all of it is highly generalized. And hence, often, not true enough to be actually useful.

Read on for an overview of the most common option.

Leave a Comment

Security and High Availability Checklist for Database Application Vendors

Andreas Wolter has a new version of a checklist:

As a database application vendor, the security and reliability of your software are key competitive differentiators. As a Database Administrator, your priority is ensuring that hosted databases do not expose data or the environment to risk. To support this, we have developed a checklist designed as a blueprint for building secure-by-default, resilient data applications. Following this guidance not only helps you align with ISO 27001 controls but also ensures that your customers can trust your product against the backdrop of a demanding enterprise environment and an evolving threat landscape.

Click through for the list, as well as a PDF version of it. I had a couple of quibbles in my immediate reaction (especially avoiding CLR and triggers), but in fairness, as I thought about whether I’d trust rando vendor XYZ to get it right, I decided that this is good advice.

Leave a Comment