Press "Enter" to skip to content

Day: June 9, 2025

Advanced Imputation Techniques via scikit-learn

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Leave a Comment

Sockets vs Cores and SQL Server

Vlad Drumea checks server settings:

It’s not uncommon that I run into a VM that’s configured with something like 6 or more cores with each core on one socket.

Here’s an example how this would show up in Task Manager for a VM with 16 CPU cores configured with 1 core per socket.

Read on to learn why this particular configuration can turn out so poorly with SQL Server, particularly when you use Standard Edition.

Leave a Comment

Optimizing SQL Server via Indirect Checkpoints

Jon Russell covers a quiet feature:

A checkpoint is a background process that writes dirty pages to disk. A checkpoint performs a full scan of the pages in the buffer pool, lists all the dirty pages that are yet to be written to disk, and finally writes those pages to disk. In SQL instances that do not have many dirty pages in the buffer pool, this is a trivial operation. However, with SQL instances that have OLTP databases, use more memory and/or involve sequential scanning of all pages, the performance of the system could be impacted.

With SQL Server 2012, indirect checkpoints were introduced. In this case, the dirty page manager manages the dirty page list and keeps track of all the dirty page modifiers of the database. By default, it runs every 60 seconds and tracks the dirty pages that need to be flushed.

Read on to learn more about why indirect checkpointing exists, the kinds of capabilities it offers, and the extent to which you might want to tweak its settings.

Leave a Comment

When All AG Nodes are Secondaries

Randy Knight demands quorum:

If you’ve encountered a situation where none of your SQL Server Always On Availability Group (AG) replicas become PRIMARY after a cluster failure — you’re not alone.  We recently had a customer with this exact scenario (AG won’t become primary after force quorum), and it is both uncommon and difficult to troubleshoot so I thought it would be worth posting about.

Click through for the scenario, what’s happening, and how to resolve this.

Leave a Comment

Trying out the PostgreSQL Extension for VS Code

Joey D’Antoni is pleased:

I’ve been using it for about a month now, and I have been favorably impressed. It’s a lot better user experience than PGAdmin, has some additional features—like being able to easily export query results. Much like some of the other Azure Data solutions—you can also connect with Entra ID and navigate to your Azure resources. Beyond the basics—you can also use Copilot to chat with your database.

Click through for a demonstration of the extension.

Leave a Comment

High Water Mark and PostgreSQL Vacuum Operations

Shane Borden troubleshoots an issue:

The first thing we came to understand is that the pattern of work on the primary is a somewhat frequent large DELETE statement followed by a data refresh accomplished by a COPY from STDIN command against a partitioned table with 16 hash partitions.

The problem being observed was that periodically the SELECTs occurring on the read replica would time out and not meet the SLA. Upon investigation, we found that the “startup” process on the read replica would periodically request an “exclusive lock” on some random partition. This exclusive lock would block the SELECT (which is partition unaware) and then cause the timeout. But what is causing the timeout?

Read on for the answer and tips on how to determine if you have problems with High Water Mark growth in PostgreSQL.

Leave a Comment

Tuning SSIS Data Flow Buffers

Andy Brownsword speeds things up:

When using data flows in SSIS packages we want the flow to be fast and fluid. Improving performance of the flows will vary in different packages, but one element can consistently help – tuning memory utilisation.

In this post we’ll look at tuning the memory utilisation by altering the number of rows passing through the flow concurrently. Specifically we’re looking at the following properties:

  • DefaultBufferSize
  • DefaultBufferMaxRows

This is a pretty big deal, finding the right size that allows you to keep data flowing smoothly without having to wait for buffers to fill. The defaults are aggressively low in most cases. And good on Andy for providing a series of tests to give practical numbers.

Leave a Comment