Press "Enter" to skip to content

Day: December 28, 2020

Using Powershell to Automate Azure Databricks Processes

Tomaz Kastrun continues a series on Databricks:

Yesterday we looked into bringing the capabilities of Databricks closer to your client machine. And making that coding, data wrangling and data science little bit more convenient.

Today we will look into deploying Databricks workspace using Powershell.

By the way, if Powershell automation of Databricks tasks is of interest to you, also check out Gerhard Brueckl’s extension module for much more along those lines.

Also, I give Tomaz a lot of credit: most Advent calendars stop at 24 days but Tomaz laughs off such limitations.

Comments closed

Another Batch of ETL Antipatterns

Tim Mitchell wraps up a series on ETL antipatterns with three posts. The first one is about not testing the ETL process:

Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.

The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.

The second post looks at processes which don’t scale:

With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.

Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.

The final post implores us to think of the documentation:

Documentation is an asset that is both loathed and loved. Creating technical and business documentation is often looked upon as a tedious chore, something that really ought to be done for every project but is often an easy candidate to push until later (or skip entirely).

On the other hand, good documentation – particularly around data movement and ETL processes – is as valuable as the processes it describes. A clear and up-to-date document describing the what, when, where, and why of an ETL workflow adds transparency and makes the process much easier to understand for those who support it.

This has been an enjoyable series from Tim, so if you haven’t already, do check it out.

Comments closed

Collecting Database Name with Extended Events

Aaron Bertrand has a philosophical dilemma:

Now, in order to capture this data to DDLEventLog, we have a background process running on a schedule that extracts all the new data from the latest .xel files and loads it into the table. (There’s also a process that moves files we know we’ve consumed and deletes them 90 days later to keep the loading job linear.)

This revealed an edge case where it is possible to lose one aspect of the data: database name. When originally building the session, I thought database_id would be enough because the loading process could always look up the name from there. But this is brittle. Imagine the case where a user drops a table, drops the database, then the loading process pulls data from the session. That data about dropping the table contains just a database_id that no longer exists.

Read on for Aaron’s thoughts and ultimate decision.

Comments closed

Data Modeling Essentials in Power BI

Paul Turley continues a series on doing Power BI the right way:

One of the most important lessons I have learned about data modeling over the past 20+ years is that there isn’t one model that fits all business needs. However, there are simple patterns we can follow to model data according to different business needs. Each pattern or schema has a different set of rules. At a very high level, we can think of these schemas in three different categories.

This is the 101 level course, but it’s good to get a refresher on the fundamentals before jumping into the complicated part.

Comments closed

Retrieving Azure Log Analytics Data using Azure Data Factory

Meagan Longoria needs to move some log data around:

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

Click through for an example of moving this resultant data into Azure Storage.

Comments closed