August 2022 – Page 11

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.
Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Forced Plans Lacking Force

Published 2022-08-10 by Kevin Feasel

Chad Callihan does not fall astray of Betteridge’s Law of Headlines:

Query Store is useful for forcing a beneficial plan. What if something changes that makes the forced plan impossible to use? Let’s look at an example of that today.

Read on for an example. The answer is, I believe, the best possible outcome given the circumstances—certainly better than a hard failure.

Comments closed

Power BI Desktop August 2022 Updates

Published 2022-08-10 by Kevin Feasel

Matt Allington looks at some recent updates to Power BI:

I’ve been pretty busy over the last few months. The demand for Power BI skills has never been stronger, and my company is super busy. I haven’t written a blog article for a while, but I wanted to take a bit of time out this morning to talk about the August 2022 update to Power BI Desktop. As Power BI matures, there is less and less to get excited about with a new release of Desktop, but there were a couple of things that caught my eye in this release, worthy of calling out.

Read on for a couple of quality of life improvements.

Comments closed

Power BI Culture Name errors

Published 2022-08-10 by Kevin Feasel

Gilbert Quevauvilliers is a cultured individual:

I was recently working with an overseas customer and when I tried to open the PBIX file I got the following error
“Culture Name ‘en-IL” is not valid or is not supported”

Read on to understand what the problem was as well as how Gilbert was able to fix it.

Comments closed

Finding Guidance on Power BI

Published 2022-08-10 by Kevin Feasel

James Serra puts together a compendium:

Recently there has been a number of great articles published on Power BI that I wanted to make you aware of that go beyond the features descriptions found in the Power BI documentation. These new articles fall under the Power BI guidance documentation and are designed to address common strategic patterns. Below is my summary of the articles, and check out Power BI guidance from the CAT by Matthew Roach for a more detailed summary.

If you’re interested in Power BI administration and strategic deployment, there’s a lot of good information here.

Comments closed

Calculating Working Days using DAX

Published 2022-08-10 by Kevin Feasel

Chen Hirsh counts down the days:

Recently Microsoft announced a new DAX function to calculate working days, meaning days between 2 dates, ignoring weekends and (optionally) holidays.

Click through for more information on this function and how to define your working days.

Comments closed

Formatting Binary LSN Values

Published 2022-08-10 by Kevin Feasel

Michael J. Swart does a bit of shuffling:

Typically as developers, we don’t care about these values. But when we do want to dig into the transaction log, we can do so with sys.fn_dblog which takes two optional parameters. These parameters are LSN values which limit the results of sys.fn_dblog. But the weird thing is that sys.fn_dblogis a function whose LSN parameters are NVARCHAR(25).
The function sys.fn_dblog doesn’t expect binary(10) values for its LSN parameters, it wants the LSN values as a formatted string, something like: 0x00000029:00001a3c:0002.

Never fear, though: Michael’s got us covered. Click through for a conversion function.

Comments closed

Creating Identity Columns in Databricks

Published 2022-08-09 by Kevin Feasel

Franco Patano generates some identity integers:

Identity columns solve the issues mentioned above and provide a simple, performant solution for generating surrogate keys. Delta Lake is the first data lake protocol to enable identity columns for surrogate key generation.
Delta Lake now supports creating IDENTITY columns that can automatically generate unique, auto-incrementing ID numbers when new rows are loaded. While these ID numbers may not be consecutive, Delta makes the best effort to keep the gap as small as possible. You can use this feature to create surrogate keys for your data warehousing workloads easily.

This is a bit light on explanation, unfortunately. With distributed systems, generating identities is historically tricky (especially with several independent nodes generating values) so I’d be curious to see how it works: do they allocate blocks of IDs to worker nodes or do something else? And are the IDs guaranteed to be monotonically increasing? Or is there some other service which “labels” the data upon insert and provides those IDs?

Comments closed

Understanding Decision Trees

Published 2022-08-09 by Kevin Feasel

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.
The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.
It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Date Arithmetic in KQL

Published 2022-08-09 by Kevin Feasel

Robert Cain continues a series on KQL:

Performing DateTime arithmetic in Kusto is very easy. You simply take one DateTime data type object and apply standard math to it, such as addition, subtraction, and more. In this post we’ll see some examples of the most common DateTime arithmetic done when authoring KQL.

Read on for several examples of how it all works.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Month: August 2022

Anomaly Detection over Delta Live Tables

Forced Plans Lacking Force

Power BI Desktop August 2022 Updates

Power BI Culture Name errors

Finding Guidance on Power BI

Calculating Working Days using DAX

Formatting Binary LSN Values

Creating Identity Columns in Databricks

Understanding Decision Trees

Date Arithmetic in KQL