Press "Enter" to skip to content

Author: Kevin Feasel

Inlining KQL in Power Query

Chris Webb shows you how you can include KQL query fragments in Power Query:

If the title wasn’t enough to warn you, this post is only going to be of interest to M ultra-geeks and people using Power BI with Azure Data Explorer – and I know there aren’t many people in either group. However I thought the feature I’m going to show you in this post is so cool I couldn’t resist blogging about it.

Limited in its utility, but still quite interesting.

Comments closed

Using Powershell to Automate Azure Databricks Processes

Tomaz Kastrun continues a series on Databricks:

Yesterday we looked into bringing the capabilities of Databricks closer to your client machine. And making that coding, data wrangling and data science little bit more convenient.

Today we will look into deploying Databricks workspace using Powershell.

By the way, if Powershell automation of Databricks tasks is of interest to you, also check out Gerhard Brueckl’s extension module for much more along those lines.

Also, I give Tomaz a lot of credit: most Advent calendars stop at 24 days but Tomaz laughs off such limitations.

Comments closed

Another Batch of ETL Antipatterns

Tim Mitchell wraps up a series on ETL antipatterns with three posts. The first one is about not testing the ETL process:

Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.

The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.

The second post looks at processes which don’t scale:

With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.

Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.

The final post implores us to think of the documentation:

Documentation is an asset that is both loathed and loved. Creating technical and business documentation is often looked upon as a tedious chore, something that really ought to be done for every project but is often an easy candidate to push until later (or skip entirely).

On the other hand, good documentation – particularly around data movement and ETL processes – is as valuable as the processes it describes. A clear and up-to-date document describing the what, when, where, and why of an ETL workflow adds transparency and makes the process much easier to understand for those who support it.

This has been an enjoyable series from Tim, so if you haven’t already, do check it out.

Comments closed

Collecting Database Name with Extended Events

Aaron Bertrand has a philosophical dilemma:

Now, in order to capture this data to DDLEventLog, we have a background process running on a schedule that extracts all the new data from the latest .xel files and loads it into the table. (There’s also a process that moves files we know we’ve consumed and deletes them 90 days later to keep the loading job linear.)

This revealed an edge case where it is possible to lose one aspect of the data: database name. When originally building the session, I thought database_id would be enough because the loading process could always look up the name from there. But this is brittle. Imagine the case where a user drops a table, drops the database, then the loading process pulls data from the session. That data about dropping the table contains just a database_id that no longer exists.

Read on for Aaron’s thoughts and ultimate decision.

Comments closed

Retrieving Azure Log Analytics Data using Azure Data Factory

Meagan Longoria needs to move some log data around:

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

Click through for an example of moving this resultant data into Azure Storage.

Comments closed

Data Modeling Essentials in Power BI

Paul Turley continues a series on doing Power BI the right way:

One of the most important lessons I have learned about data modeling over the past 20+ years is that there isn’t one model that fits all business needs. However, there are simple patterns we can follow to model data according to different business needs. Each pattern or schema has a different set of rules. At a very high level, we can think of these schemas in three different categories.

This is the 101 level course, but it’s good to get a refresher on the fundamentals before jumping into the complicated part.

Comments closed

Spark Streaming in a Databricks Notebook

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Visualization and the Value of Expectations

Alex Velez thinks about violating expectations in visuals:

This isn’t to say we should never deviate from normal graphing conventions, but we should have a good reason for doing so—a reason that makes up for any unintended consequences. 

What other design decisions might also take our audience by surprise—going against normal graphing expectations? I’ll outline a few. 

Click through for examples. One thing not explicitly brought up is that we follow conventions to reduce the amount of thought needed to understand something. For circumstances in which there’s a major benefit, you might want to run that risk. Also, there’s an argument in here that, at some point, it’s better to have something radically different than marginally different.

Comments closed

PASS: the End of an Era

Mala Mahadevan reflects on 22 years of association with PASS:

I finally decided I would write about the lessons I’ve learned in my 22 year association with them. This is necessary for me to move on and may be worth reading for those who think similar.
There is the common line that PASS is not the #sqlfamily, and that line is currently true. But back in those days, it was. Atleast it was our introduction to the community commonly known as #sqlfamily. So many lessons here are in fact lessons in dealing with and living with community issues.

Read on to learn from Mala.

Comments closed