Press "Enter" to skip to content

Author: Kevin Feasel

Advanced Imputation Techniques via scikit-learn

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Leave a Comment

Sockets vs Cores and SQL Server

Vlad Drumea checks server settings:

It’s not uncommon that I run into a VM that’s configured with something like 6 or more cores with each core on one socket.

Here’s an example how this would show up in Task Manager for a VM with 16 CPU cores configured with 1 core per socket.

Read on to learn why this particular configuration can turn out so poorly with SQL Server, particularly when you use Standard Edition.

Leave a Comment

Optimizing SQL Server via Indirect Checkpoints

Jon Russell covers a quiet feature:

A checkpoint is a background process that writes dirty pages to disk. A checkpoint performs a full scan of the pages in the buffer pool, lists all the dirty pages that are yet to be written to disk, and finally writes those pages to disk. In SQL instances that do not have many dirty pages in the buffer pool, this is a trivial operation. However, with SQL instances that have OLTP databases, use more memory and/or involve sequential scanning of all pages, the performance of the system could be impacted.

With SQL Server 2012, indirect checkpoints were introduced. In this case, the dirty page manager manages the dirty page list and keeps track of all the dirty page modifiers of the database. By default, it runs every 60 seconds and tracks the dirty pages that need to be flushed.

Read on to learn more about why indirect checkpointing exists, the kinds of capabilities it offers, and the extent to which you might want to tweak its settings.

Leave a Comment

When All AG Nodes are Secondaries

Randy Knight demands quorum:

If you’ve encountered a situation where none of your SQL Server Always On Availability Group (AG) replicas become PRIMARY after a cluster failure — you’re not alone.  We recently had a customer with this exact scenario (AG won’t become primary after force quorum), and it is both uncommon and difficult to troubleshoot so I thought it would be worth posting about.

Click through for the scenario, what’s happening, and how to resolve this.

Leave a Comment

Trying out the PostgreSQL Extension for VS Code

Joey D’Antoni is pleased:

I’ve been using it for about a month now, and I have been favorably impressed. It’s a lot better user experience than PGAdmin, has some additional features—like being able to easily export query results. Much like some of the other Azure Data solutions—you can also connect with Entra ID and navigate to your Azure resources. Beyond the basics—you can also use Copilot to chat with your database.

Click through for a demonstration of the extension.

Leave a Comment

High Water Mark and PostgreSQL Vacuum Operations

Shane Borden troubleshoots an issue:

The first thing we came to understand is that the pattern of work on the primary is a somewhat frequent large DELETE statement followed by a data refresh accomplished by a COPY from STDIN command against a partitioned table with 16 hash partitions.

The problem being observed was that periodically the SELECTs occurring on the read replica would time out and not meet the SLA. Upon investigation, we found that the “startup” process on the read replica would periodically request an “exclusive lock” on some random partition. This exclusive lock would block the SELECT (which is partition unaware) and then cause the timeout. But what is causing the timeout?

Read on for the answer and tips on how to determine if you have problems with High Water Mark growth in PostgreSQL.

Leave a Comment

Tuning SSIS Data Flow Buffers

Andy Brownsword speeds things up:

When using data flows in SSIS packages we want the flow to be fast and fluid. Improving performance of the flows will vary in different packages, but one element can consistently help – tuning memory utilisation.

In this post we’ll look at tuning the memory utilisation by altering the number of rows passing through the flow concurrently. Specifically we’re looking at the following properties:

  • DefaultBufferSize
  • DefaultBufferMaxRows

This is a pretty big deal, finding the right size that allows you to keep data flowing smoothly without having to wait for buffers to fill. The defaults are aggressively low in most cases. And good on Andy for providing a series of tests to give practical numbers.

Leave a Comment

A Primer on Loss Functions

Kanwal Mehreen compares loss functions:

I must say, with the ongoing hype around machine learning, a lot of people jump straight to the application side without really understanding how things work behind the scenes. What’s our objective with any machine learning model, anyway? You might say, “To make accurate predictions.” Fair enough.

But how do you actually tell your model, “You’re close” or “You’re way off”? How does it know it made a mistake — and by how much?

That’s where loss functions come in.

Read on to learn what loss functions are, how they work, and when you might want to choose each.

Leave a Comment

Listing Items in a Fabric Workspace using Powershell

Rob Sewell continues a series on working with Microsoft Fabric assets via Powershell:

Having created a workspace, a lakehouse, a warehouse, a Fabric SQL database previously, it’s time to move onto something a little more interesting. I was getting bored writing the same post over and over again, so I thought I would show you how to list the items in a Fabric Workspace using PowerShell. This command was super useful for me today to answer the question “What do they have in this workspace?” and I hope it will be useful for you too.

Click through to see how you can accomplish this.

Leave a Comment

The Challenges of Benchmarking

Tomas Vondra digs into some metrics:

I do a fair number of benchmarks, not only to validate patches, but also to find interesting (suspicious) stuff to improve. It’s an important part of my development workflow. And it’s fun 😉 But we’re dealing with complex systems (hardware, OS, DB, application), and that brings challenges. Every now and then I run into something that I don’t quite understand.

Consider a read-only pgbench, the simplest workload there is, with a single SELECT doing lookup by PK. If you do this with a small data set on any machine, the expectation is near linear scaling up to the number of cores. It’s not perfect, CPUs have frequency scaling and power management, but it should be close.

Click through for a quick look at some odd behavior, followed by a lot of interesting digging into the weeds trying to find answers.

Leave a Comment