Press "Enter" to skip to content

Author: Kevin Feasel

Making k-Means Clustering Better

Matthew Mayo shares a few tips:

The k-means algorithm is a cornerstone of unsupervised machine learning, known for its simplicity and trusted for its efficiency in partitioning data into a predetermined number of clusters. Its straightforward approach — assigning data points to the nearest centroid and then updating the centroid based on the mean of the assigned points — makes it one of the first algorithms most data scientists learn. It is a workhorse, capable of providing quick and valuable insights into the underlying structure of a dataset.

This simplicity comes with a set of limitations, however. Standard k-means often struggles when faced with the complexities of real-world data. Its performance can be sensitive to the initial placement of centroids, it requires the number of clusters to be specified in advance, and it fundamentally assumes that clusters are spherical and evenly sized. These assumptions rarely hold true in the wild, leading to suboptimal or even misleading results.

Read on for a few ways to relax some of the constraints in k-means clustering.

Leave a Comment

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Leave a Comment

PostgreSQL for the Oracle DBA: Tuples and MVCC

Kellyn Gorman continues a series on PostgreSQL for Oracle DBAs:

After our first two weeks of ensuring Grant and I didn’t burn down SQLServerCentral figuring out how Steve Jones has kept the pace he has for so long, (quite an impressive feat, I think we’d both agree!) I’m back to working with my comparisons and building more knowledge in PostgreSQL.  What caught my attention this week was the simple concept of a row (or tuple) which might seem universal in relational systems, after all, data is data no matter the platform, right? But under the hood, the way databases store, manage, and control visibility of that data can differ drastically. This is especially true when comparing PostgreSQL with how Oracle manages rows.

Read on to learn more about how these systems work.

Leave a Comment

Fast-Path Search in OrioleDB

Alexander Korotkov describes a new feature coming to OrioleDB:

When you optimize the CPU time of a transactional database management system, it comes down to one question: how fast can you read a page without breaking consistency? In this post, we explore how OrioleDB avoids locks, trims memory copies, and — starting with beta12 — even bypasses both copying and tuple deforming altogether for fixed-length types during intra-page search. This means that not only are memory copies skipped, but the overhead of reconstructing tuples is also eliminated. The result: an even faster read path, with no manual tuning required.

Read on to see what’s new and how it works.

Leave a Comment

Installing SQL Server CUs via dbatools

David Seis updates an instance or thirty:

Keeping your SQL Server instance up to date with the latest service packs and cumulative updates is one of the key components of SQL Server security. In this blog post, we will audit the dbatools command Update-DbaInstance, which can save you hours every month and help keep your environment more secure.

I will test, review, and evaluate the script based on a series of identical steps. Our goal is to provide insights, warnings, and recommendations to help you use this script effectively and safely. Update-DbaInstance is powerful tool to automate the update of one or many SQL server instances without needing to log in.

Click through for David’s thoughts on the cmdlet.

Leave a Comment

Query Store Plan Forcing and RECOMPILE

Vlad Drumea tests Betteridge’s Law of Headlines:

It’s 3AM and I’m curious if Query Store plan forcing works on queries that use the RECOMPILE hint (aka OPTION(RECOMPILE) ).
So might as well make a blog post out of it.

Hmm, on the one hand, Vlad blatantly violates Betteridge’s Law of Headlines, which loses points. On the other hand, Vlad dunks hard on LLMs, gaining points. I’m so conflicted right now…

But not about reading Vlad’s post. Go read it.

Leave a Comment

Choosing a Good Split for a Decision Tree

Ivan Palomares Carrascosa continues a series on decision trees:

But what are the underlying mechanisms that make decision trees so well-suited for various predictive tasks? And what criteria are internally used to construct them? Specifically, how are nodes recursively split as the tree-shaped structure is formed? This article takes a closer look at the inner workings of decision trees, focusing on how branches are created through deliberate, data-driven splitting (spoiler: it certainly doesn’t happen at random).

One of the main principles of CART is around finding efficient splits for trees, and this digs into some of those details.

Leave a Comment

Result Set Chaining in Snowflake

Kevin Wilkie tries out a new operator:

In a recent Snowflake release, a slick new operator quietly entered the scene: ->>. This little guy can make certain query workflows both more readable and more efficient—especially when you’re dealing with multi-step commands like SHOWLIST, or DESCRIBE.

Click through to see how it works. Seems that this operator has some pretty strict limitations, but for certain use cases, it’s quite nice.

Leave a Comment