Press "Enter" to skip to content

Curated SQL Posts

Managing Power BI Assets with semantic-link-labs

Kurt Buhler takes us through a Python library:

Thus far, the part of Microsoft Fabric that I’ve personally found the most interesting is not Copilot, Direct Lake, or its data warehousing capabilities, but a combination of notebooks and simple file/table storage via Lakehouses. Specifically, the library semantic link and its “expansion pack” semantic-link-labs, spearheaded by Michael Kovalsky. These tools help you build, manage, use, and audit the various items in Fabric from a Python notebook, including Power BI semantic models and reports.

Semantic-link-labs provide a lot of convenient functions that you can use to automate and streamline certain tasks during Power BI development; both of models and reports. For me, I’m particularly interested in the reporting functionalities, because this is where I typically find that I lose the most time, and because there is a drought of tools to address this area.

Read the whole thing.

Comments closed

The Internals of Data Updates in PostgreSQL

Cary Huang explains how update operations work:

In previous blogs, we talked about an overview of PostgreSQL’s table access method API here , how sequential scan is handled within this API here, and how data insertion is handled here. Today in this blog, we will look closely into how PostgreSQL handles update. A successful update in PostgreSQL can be viewed as “insert a new record” while “marking the old record as invisible” due to the MVCC technique that PostgreSQL employs. It sounds simple enough, but there are quite a lot of considerations in place to make a successful update. Let’s dive in.

There’s a lot going on behind the scenes, and Cary does a good job of explaining it all.

Comments closed

Techniques for Unpivoting Data in SQL Server

Jared Westover performs a technique showdown:

A few weeks ago, I helped someone combine multiple integer columns into a single column. The common term for this process is unpivoting or transposing. The table they worked with had millions of rows, so they needed the most efficient method to accomplish this task. If you search online, you’ll find several suggestions for doing this. But which performs best? Is one easier to maintain than the others?

Click through for a review of three separate techniques: using the UNPIVOT operator, using UNION ALL, and using CROSS APPLY. The dataset was relatively small, but even at that size, CROSS APPLY did a good job. But I won’t spoil too much here.

Comments closed

Query Store Not Capturing Runtime Stats for Unfinished Queries

Matt Changchien does a bit of testing:

It’s beneficial to understand how Query Store captures query runtime stats when they are used for performance troubleshooting. According to public documentation: upon the query execution, runtime statistics are sent to Query Store.

I am curious whether Query Store will still capture execution statistics under certain scenarios. Hence, in this blog post, I will be testing the two following scenarios with my Azure SQL Database:

  1. Whether Query Store captures runtime stats for query that has completed within an uncommitted transaction, then the transaction rolls back due to a scaling process.
  2. Whether Query Store captures runtime stats for query that is blocked and does not finish before a disconnect caused by a disruptive event (e.g., a service tier scale-up).

Read on to learn more about these two scenarios. The specific context is Azure SQL Database, though the same rules would apply on-premises as well.

Comments closed

Not all Window Function Calls are Deterministic

Chen Hirsh reminds us of an important fact:

One of the scariest things that can happen when you develop SQL code is getting indeterminate results. When you run your query multiple times and each time gets a different value.

How can that happen?

SQL functions or queries can be categorized into two groups – deterministic vs. non-deterministic.

I even have a demo where I show off the non-deterministic nature of ROW_NUMBER() versus, say, RANK() or DENSE_RANK(). But there’s nothing too scary about it: just ensure that your PARTITION BY criteria are guaranteed to be unique and you’re good to go. H/T Madeira Data Solutions blog.

Comments closed

Test those Backups

Grant Fritchey provides an important public service announcement:

Please, let me reiterate: The only valid test of a backup is a restore.

THE

ONLY

VALID

TEST

OF

A

BACKUP

IS

A

RESTORE.

Grant is absolutely correct here. And do you know how hard it is to test a SQL Server backup? With dbatools, it’s this hard: Test-DbaLastBackup. That one cmdlet restores a backup (under a different name so it doesn’t clobber your existing database), runs DBCC CHECKDB against the restored data, drops the test database after the fact, and even allows you to do all of that against a different SQL Server instance.

Incidentally, my next YouTube series just happens to be around backup and restoration in SQL Server, so stay tuned if you’re interested.

Comments closed

Tracking Airport Traffic with Flink, Kafka, and NiFi

Tim Spann builds an app:

The above link utilizes the standard REST link and enhances it by setting the beginning date using NiFi’s Expression language to get the current time in UNIX format in seconds. In this example, I am looking at the last week of data for the airport departures and arrivals in the second URL.

We iterate through a list of the largest airports in the United States doing both departures and arrivals since they use the same format.

Read the article to learn more about how you can tie it all together. You can also check out Tim’s GitHub repo to grab the code.

Comments closed

Handling Missing Data with XGBoost

Vinod Chugani is missing a few data points:

XGBoost has gained widespread recognition for its impressive performance in numerous Kaggle competitions, making it a favored choice for tackling complex machine learning challenges. Known for its efficiency in handling large datasets, this powerful algorithm stands out for its practicality and effectiveness.

In this post, we will apply XGBoost to the Ames Housing dataset to demonstrate its unique capabilities. Building on our prior discussion of the Gradient Boosting Regressor (GBR), we will explore key features that differentiate XGBoost from GBR, including its advanced approach to managing missing values and categorical data.

Read on to see how it fares.

Comments closed

Window Functions and Running Totals

Steve Jones makes a comparison:

Often I see running totals that are written in SQL using a variety of techniques. Many pieces of code were written in pre-2012 techniques, prior to window functions being introduced.

After SQL Server 2012, we had better ways to write a total. In this case, let’s see how much better. This is based on an article showing how you might convert code from the first query to the second. This is a performance analysis of the two techniques are different scales..

Steve shows a very fancy version of the self-join technique, which is actually even slower than using a cursor for this work. The fastest variant on the technique was something nicknamed the ‘quirky update’ technique, but it relied on an accidental property of how clustered indexes worked on temp tables in SQL Server and was not something Microsoft ever officially supported, meaning that any service pack, cumulative update, or hotfix might have broken your code and the best you’d get is an indifferent shrug.

Comments closed