Press "Enter" to skip to content

Author: Kevin Feasel

Not all Window Function Calls are Deterministic

Chen Hirsh reminds us of an important fact:

One of the scariest things that can happen when you develop SQL code is getting indeterminate results. When you run your query multiple times and each time gets a different value.

How can that happen?

SQL functions or queries can be categorized into two groups – deterministic vs. non-deterministic.

I even have a demo where I show off the non-deterministic nature of ROW_NUMBER() versus, say, RANK() or DENSE_RANK(). But there’s nothing too scary about it: just ensure that your PARTITION BY criteria are guaranteed to be unique and you’re good to go. H/T Madeira Data Solutions blog.

Comments closed

Query Store Not Capturing Runtime Stats for Unfinished Queries

Matt Changchien does a bit of testing:

It’s beneficial to understand how Query Store captures query runtime stats when they are used for performance troubleshooting. According to public documentation: upon the query execution, runtime statistics are sent to Query Store.

I am curious whether Query Store will still capture execution statistics under certain scenarios. Hence, in this blog post, I will be testing the two following scenarios with my Azure SQL Database:

  1. Whether Query Store captures runtime stats for query that has completed within an uncommitted transaction, then the transaction rolls back due to a scaling process.
  2. Whether Query Store captures runtime stats for query that is blocked and does not finish before a disconnect caused by a disruptive event (e.g., a service tier scale-up).

Read on to learn more about these two scenarios. The specific context is Azure SQL Database, though the same rules would apply on-premises as well.

Comments closed

Test those Backups

Grant Fritchey provides an important public service announcement:

Please, let me reiterate: The only valid test of a backup is a restore.

THE

ONLY

VALID

TEST

OF

A

BACKUP

IS

A

RESTORE.

Grant is absolutely correct here. And do you know how hard it is to test a SQL Server backup? With dbatools, it’s this hard: Test-DbaLastBackup. That one cmdlet restores a backup (under a different name so it doesn’t clobber your existing database), runs DBCC CHECKDB against the restored data, drops the test database after the fact, and even allows you to do all of that against a different SQL Server instance.

Incidentally, my next YouTube series just happens to be around backup and restoration in SQL Server, so stay tuned if you’re interested.

Comments closed

Handling Missing Data with XGBoost

Vinod Chugani is missing a few data points:

XGBoost has gained widespread recognition for its impressive performance in numerous Kaggle competitions, making it a favored choice for tackling complex machine learning challenges. Known for its efficiency in handling large datasets, this powerful algorithm stands out for its practicality and effectiveness.

In this post, we will apply XGBoost to the Ames Housing dataset to demonstrate its unique capabilities. Building on our prior discussion of the Gradient Boosting Regressor (GBR), we will explore key features that differentiate XGBoost from GBR, including its advanced approach to managing missing values and categorical data.

Read on to see how it fares.

Comments closed

Tracking Airport Traffic with Flink, Kafka, and NiFi

Tim Spann builds an app:

The above link utilizes the standard REST link and enhances it by setting the beginning date using NiFi’s Expression language to get the current time in UNIX format in seconds. In this example, I am looking at the last week of data for the airport departures and arrivals in the second URL.

We iterate through a list of the largest airports in the United States doing both departures and arrivals since they use the same format.

Read the article to learn more about how you can tie it all together. You can also check out Tim’s GitHub repo to grab the code.

Comments closed

Window Functions and Running Totals

Steve Jones makes a comparison:

Often I see running totals that are written in SQL using a variety of techniques. Many pieces of code were written in pre-2012 techniques, prior to window functions being introduced.

After SQL Server 2012, we had better ways to write a total. In this case, let’s see how much better. This is based on an article showing how you might convert code from the first query to the second. This is a performance analysis of the two techniques are different scales..

Steve shows a very fancy version of the self-join technique, which is actually even slower than using a cursor for this work. The fastest variant on the technique was something nicknamed the ‘quirky update’ technique, but it relied on an accidental property of how clustered indexes worked on temp tables in SQL Server and was not something Microsoft ever officially supported, meaning that any service pack, cumulative update, or hotfix might have broken your code and the best you’d get is an indifferent shrug.

Comments closed

Time-Weighted Averages and Values in Azure Data Explorer

Adi Eldar shows off a few new functions:

Azure Data Explorer (ADX) supports time series aggregation at scale, either by the summarize operator that keeps the aggregated data in tabular format or by the make-series operator that transforms it to a set of dynamic arrays. There are multiple aggregation functions, out of them avg() is one of the most popular. ADX calculates it by grouping the samples into fixed time bins and applying simple average of all samples inside each time bin, regardless of their specific location inside the bin. This is the standard time bin aggregation as done by SQL and other databases. However, there are scenarios where simple average doesn’t accurately represent the time bin value. For example, IoT devices sending data commonly emits metric values in an asynchronous way, only upon change, to conserve bandwidth. In that case we need to calculate Time Weighted Average (TWA), taking into consideration the exact timestamp and duration of each value inside the time bin. ADX doesn’t have native aggregation functions to calculate time weighted average, still we have just added few User Defined Functions, part of the Functions Library, supporting it:

Digging into how the weighted averages work, they’re pretty interesting.

Comments closed

Implementing a Star Schema in a Microsoft Fabric Lakehouse

Nikola Ilic builds a lakehouse:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is that I’m not covering it in this article because this one focuses on explaining how to implement a star schema in Fabric Lakehouse (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

In one of the previous articles, I also shown how to implement a star schema in Power BI, by leveraging Power Query Editor.

Now, let’s get our hands dirty and build a star schema by using PySpark in the Fabric notebook!

Click through to see how.

Comments closed

Managing Orphaned Users in SQL Server

Jordan Boich asks for more gruel:

Addressing orphaned users is an important piece of SQL Server security management. When left unchecked, you can accumulate an overwhelming number of users that exist in a database but do not have a correlating login in the master database, thus preventing access to the server or the database at all. There are dbatools PowerShell modules out there that can help you accomplish what sp_FindOrphanedUser does. However, there are some situations where PowerShell may not be available to you for a multitude of reasons, and having an extra tool in the toolbelt never hurt anyone, am I right?

Read on to learn about orphaned users and how the sp_FindOrphanedUser procedure works.

Comments closed