Press "Enter" to skip to content

Curated SQL Posts

Stolen Memory in SQL Server

Chad Callihan is overdrawn at the memory bank:

When you specify the max amount of memory that SQL Server can use, you may think that’s all the memory to be used for the buffer pool. That’s not the case. SQL Server will use memory not only for the buffer pool but also in tasks such as sorts, memory grants, and other internal tasks. These tasks can swoop in and “steal” memory from the buffer pool, causing more work for SQL Server.

Click through for some notes on the topic, as well as your reminder that “128 GB for the buffer pool” in Standard Edition doesn’t mean “128 GB total memory usage for SQL Server.”

Leave a Comment

Percentage Splits with Window Functions

Andy Brownsword breaks things up:

Sometimes you want to segment records. It may be splitting a customer base for marketing purposes, or segmenting a user base for a new feature. Good segmentation makes clean divisions in the data.

In this post we’ll see a way to achieve that with a great deal of help from Window Functions.

Click through for Andy’s motivation, which is a way that absolutely will not work the way you want it to.

Leave a Comment

Making k-Means Clustering Better

Matthew Mayo shares a few tips:

The k-means algorithm is a cornerstone of unsupervised machine learning, known for its simplicity and trusted for its efficiency in partitioning data into a predetermined number of clusters. Its straightforward approach — assigning data points to the nearest centroid and then updating the centroid based on the mean of the assigned points — makes it one of the first algorithms most data scientists learn. It is a workhorse, capable of providing quick and valuable insights into the underlying structure of a dataset.

This simplicity comes with a set of limitations, however. Standard k-means often struggles when faced with the complexities of real-world data. Its performance can be sensitive to the initial placement of centroids, it requires the number of clusters to be specified in advance, and it fundamentally assumes that clusters are spherical and evenly sized. These assumptions rarely hold true in the wild, leading to suboptimal or even misleading results.

Read on for a few ways to relax some of the constraints in k-means clustering.

Comments closed

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Comments closed

PostgreSQL for the Oracle DBA: Tuples and MVCC

Kellyn Gorman continues a series on PostgreSQL for Oracle DBAs:

After our first two weeks of ensuring Grant and I didn’t burn down SQLServerCentral figuring out how Steve Jones has kept the pace he has for so long, (quite an impressive feat, I think we’d both agree!) I’m back to working with my comparisons and building more knowledge in PostgreSQL.  What caught my attention this week was the simple concept of a row (or tuple) which might seem universal in relational systems, after all, data is data no matter the platform, right? But under the hood, the way databases store, manage, and control visibility of that data can differ drastically. This is especially true when comparing PostgreSQL with how Oracle manages rows.

Read on to learn more about how these systems work.

Comments closed

Fast-Path Search in OrioleDB

Alexander Korotkov describes a new feature coming to OrioleDB:

When you optimize the CPU time of a transactional database management system, it comes down to one question: how fast can you read a page without breaking consistency? In this post, we explore how OrioleDB avoids locks, trims memory copies, and — starting with beta12 — even bypasses both copying and tuple deforming altogether for fixed-length types during intra-page search. This means that not only are memory copies skipped, but the overhead of reconstructing tuples is also eliminated. The result: an even faster read path, with no manual tuning required.

Read on to see what’s new and how it works.

Comments closed

Installing SQL Server CUs via dbatools

David Seis updates an instance or thirty:

Keeping your SQL Server instance up to date with the latest service packs and cumulative updates is one of the key components of SQL Server security. In this blog post, we will audit the dbatools command Update-DbaInstance, which can save you hours every month and help keep your environment more secure.

I will test, review, and evaluate the script based on a series of identical steps. Our goal is to provide insights, warnings, and recommendations to help you use this script effectively and safely. Update-DbaInstance is powerful tool to automate the update of one or many SQL server instances without needing to log in.

Click through for David’s thoughts on the cmdlet.

Comments closed

Query Store Plan Forcing and RECOMPILE

Vlad Drumea tests Betteridge’s Law of Headlines:

It’s 3AM and I’m curious if Query Store plan forcing works on queries that use the RECOMPILE hint (aka OPTION(RECOMPILE) ).
So might as well make a blog post out of it.

Hmm, on the one hand, Vlad blatantly violates Betteridge’s Law of Headlines, which loses points. On the other hand, Vlad dunks hard on LLMs, gaining points. I’m so conflicted right now…

But not about reading Vlad’s post. Go read it.

1 Comment

Choosing a Good Split for a Decision Tree

Ivan Palomares Carrascosa continues a series on decision trees:

But what are the underlying mechanisms that make decision trees so well-suited for various predictive tasks? And what criteria are internally used to construct them? Specifically, how are nodes recursively split as the tree-shaped structure is formed? This article takes a closer look at the inner workings of decision trees, focusing on how branches are created through deliberate, data-driven splitting (spoiler: it certainly doesn’t happen at random).

One of the main principles of CART is around finding efficient splits for trees, and this digs into some of those details.

Comments closed