Press "Enter" to skip to content

Author: Kevin Feasel

Time-Weighted Averages and Values in Azure Data Explorer

Adi Eldar shows off a few new functions:

Azure Data Explorer (ADX) supports time series aggregation at scale, either by the summarize operator that keeps the aggregated data in tabular format or by the make-series operator that transforms it to a set of dynamic arrays. There are multiple aggregation functions, out of them avg() is one of the most popular. ADX calculates it by grouping the samples into fixed time bins and applying simple average of all samples inside each time bin, regardless of their specific location inside the bin. This is the standard time bin aggregation as done by SQL and other databases. However, there are scenarios where simple average doesn’t accurately represent the time bin value. For example, IoT devices sending data commonly emits metric values in an asynchronous way, only upon change, to conserve bandwidth. In that case we need to calculate Time Weighted Average (TWA), taking into consideration the exact timestamp and duration of each value inside the time bin. ADX doesn’t have native aggregation functions to calculate time weighted average, still we have just added few User Defined Functions, part of the Functions Library, supporting it:

Digging into how the weighted averages work, they’re pretty interesting.

Comments closed

Implementing a Star Schema in a Microsoft Fabric Lakehouse

Nikola Ilic builds a lakehouse:

But, what is a star schema in the first place? I have good and bad news for you:)…The bad news is that I’m not covering it in this article because this one focuses on explaining how to implement a star schema in Fabric Lakehouse (assuming that you already know what star schema is). The good news is: I’ve already written about it, so go and read this article first, if you’re not sure what star schema represents in the world of data modeling…

In one of the previous articles, I also shown how to implement a star schema in Power BI, by leveraging Power Query Editor.

Now, let’s get our hands dirty and build a star schema by using PySpark in the Fabric notebook!

Click through to see how.

Comments closed

Managing Orphaned Users in SQL Server

Jordan Boich asks for more gruel:

Addressing orphaned users is an important piece of SQL Server security management. When left unchecked, you can accumulate an overwhelming number of users that exist in a database but do not have a correlating login in the master database, thus preventing access to the server or the database at all. There are dbatools PowerShell modules out there that can help you accomplish what sp_FindOrphanedUser does. However, there are some situations where PowerShell may not be available to you for a multitude of reasons, and having an extra tool in the toolbelt never hurt anyone, am I right?

Read on to learn about orphaned users and how the sp_FindOrphanedUser procedure works.

Comments closed

Role-Playing Dimensions in Direct Lake

Chris Webb puts on a mustache and changes his shirt really quickly:

Note that the Sales fact table has two date columns, OrderDate and ShipDate.

If you create a DirectLake semantic model using the Web Editor and add these two tables you could rename the Date table to Order Date and build a relationship between it and the OrderDate column on the Sales table:

What about analysing by Ship Date though? You could create a physical copy of the Date table in your Lakehouse and add that to the model, but there’s another option.

Read on for that answer. Interesting that, as of right now, the primary way to do this is with third-party software.

Comments closed

Boosting versus Bagging in Tree Models

Vinod Chugani compares two techniques for working with trees:

Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging improves stability and accuracy by aggregating independent predictions, whereas boosting sequentially corrects the errors of prior models, improving their performance with each iteration. This post begins our deep dive into boosting, starting with the Gradient Boosting Regressor. Through its application on the Ames Housing Dataset, we will demonstrate how boosting uniquely enhances models, setting the stage for exploring various boosting techniques in upcoming posts.

Read on for more information. The neat part about the “boosting versus bagging” debate is that both techniques are quite useful. Although boosting (via algorithms like XGBoost or LightGBM) is the more popular technique, bagging (random forest) is extremely powerful in its own right.

Comments closed

Rebuilding a Transaction Log

David Fowler fixes a large-scale oopsie:

“Could you help me, we deleted the database’s transaction log file and now that database is stuck in ‘Recovery Pending’?”

This was a panicked call that I received a few weeks ago.

“Sure, no problem” said I, “we’ll have to restore back to your last backup”

And then things went silent for a while before the inevitable, “it’s only a development database, we don’t take backups”.

I can feel the face-palm from here. Read on to learn what you can do if you’re in that situation, as well as David’s important note about taking backups so that you don’t end up in this situation to begin with.

Comments closed

Connecting to Azure Storage from SSIS

Andy Brownsword makes a connection:

Migrating to the cloud can be disruptive to existing processes. Moving storage to Azure isn’t a simple configuration change for SSIS packages.

SSIS doesn’t have native connections for Azure. That doesn’t mean we need to completely re-engineer the process or change technology though.

How can we take the simple package below and move to using Azure storage?

Read on for the answer. Also, I am 100% on Team SAS Token. They are easy to create and give you a lot of control over who gets access to what.

Comments closed