Press "Enter" to skip to content

Category: ETL / ELT

Incremental Loading in ETL

Tim Mitchell gives us the wherefore around incremental loads:

When moving data in an extraction, transformation, and loading (ETL) process, the most efficient design pattern is to touch only the data you must, copying just the data that was newly added or modified since the last load was run. This pattern of incremental loads usually presents the least amount of risk, takes less time to run, and preserves the historical accuracy of the data.

In this post, I’ll share what an incremental load is and why it is the ideal design for most ETL processes.

“Move less data rather than more data” is how I’d put it, but Tim does a much better job of putting it.

Comments closed

Using Dynamic Datasets in Power BI

Jose Mendes implements the screening pattern in Power BI:

I recently came across the need to build a screening pattern based on the Kimball “Screening” concept. One of the desired outputs was a Power BI report that allowed a data steward to easily identify the failed screen, drill down to the detail and show the row(s) rejected by the data quality rule. To achieve this goal, I had to mimic in Power BI an SSRS functionality called dynamic dataset, which allowed me to, using a single matrix, show different source columns based on a selected screen. Feeling curious already? Then, let’s dive into the details.

A screen is designed to operate on a single input file or database table and contains the data quality condition to check. For example, there could be a Missing Customer Postcode screen which would test for any customers who are missing a postcode.

Read on for an implementation.

Comments closed

Azure Data Factory and Key Vault References

Gerhard Brueckl shows how we can get around a limitation in the Azure Data Factory user interface:

As You can see, the setting “AccessToken” can use a Key Vault reference whereas settings like “Databricks Workspace URL” and “Cluster” do not support them. This is usually fine because the guys at Microsoft also thought about this and support Key Vault references for the settings that are actually security relevant or sensitive. Also, providing the option to use Key Vault references everywhere would flood the GUI. So this is just fine.

But there can be good reasons where you want to get values from the Key Vault also for non-sensitive settings, especially when it comes to CI/CD and multiple environments. From my experience, when you implement a bigger ADF project, you will probably have a Key Vault for your sensitive settings and all other values are provided during the deployment via ARM parameters.

So you will end up with a mix of Key Vault references and ARM template parameters which very likely will be derived from the Key Vault at some point anyway. To solve this, you can modify the JSON of an ADF linked service directly and inject KeyVault references into almost every property!

Click through to see how that works, as well as the ramifications.

Comments closed

Comparing Integration Services and Azure Data Factory

Tim Mitchell compares SQL Server Integration Services to Azure Data Factory:

For the better part of 15 years, SQL Server Integration Services (SSIS) has been the go-to enterprise extract-transform-load (ETL) tool for shops running on Microsoft SQL Server. More recently, Microsoft added Azure Data Factory (ADF) to its stable of enterprise ETL tools. In this post, I’ll be comparing SSIS and Azure Data Factory to share how they are alike and how they differ. I’ll also review the strengths and shortcomings of each, including the architectures in which each of these is likely to do well.

Read on for Tim’s thoughts on the subject. Tim lays out his biases up-front but also gives you a good feel for where both products are in their lifecycles.

Comments closed

Incremental Pipline Development with Azure Data Factory

Andy Leonard shows how you can incrementally develop Azure Data Factory pipelines:

A friend pinged me recently to ask about rolling back Azure Data Factory (ADF) pipeline versions. My response was a question: Are you using source control with ADF? That did not help the current situation.

I thought of the way I often build ADF pipelines and shared my methodology, which is relatively simple (it has to be simple for me to understand it!):

Click through for Andy’s approach.

Comments closed

Trust and Warehouse Data

Rob Farley explains one way that people might lose trust in your warehouse data:

The scenario is that there’s a source system, and there’s a table in a warehouse that is being used to report on it. Maybe it’s being populated by Integration Services or Data Factory. Maybe it’s being populated by T-SQL. I don’t really care. What I care about is whether the data in the warehouse is a true representation of what’s in the source system.

If it’s not a true representation, then we have all kinds of problems.

Mostly, that our warehouse is rubbish.

Read on for an example of how this might occur and what you can do to prevent it.

Comments closed

Refreshing Power BI Datasets in Azure Data Factory

Meagan Longoria shows us how to refresh a Power BI dataset using Azure Data Factory:

I recently needed to ensure that a Power BI imported dataset would be refreshed after populating data in my data mart. I was already using Azure Data Factory to populate the data mart, so the most efficient thing to do was to call a pipeline at the end of my data load process to refresh the Power BI dataset.

Power BI offers REST APIs to programmatically refresh your data. For Data Factory to use them, you need to register an app (service principal) in AAD and give it the appropriate permissions in Power BI and to an Azure key vault.

Click through for the solution.

Comments closed

Migrating From Cosmos DB to SQL Server

Eitan Blumin builds an app:

The general idea is this:

The app executes a Cosmos DB query and collects a number of records into its “buffer”.

Once that “buffer” reaches a certain number of records (configurable), it’s time to “flush” it into the SQL Server. That could be either a database table receiving a Bulk Copy stream, or a stored procedure receiving a table valued parameter (again, configurable).

After the buffer is flushed, we have the option to execute a “merge” procedure. This is a stored procedure that would implement an “upsert” logic from the “staging” table and into the actual destination table.

Read on for more explanation and check out Eitan’s GitHub repo.

Comments closed

Deploying ADF via Azure DevOps

Kamil Nowinski has part two on a series about releasing Azure Data Factory code:

Struggling with #ADF deployment? adf_publish branch doesn’t suit your purposes? Don’t have skills with PowerShell? I have good news for you. There is a new tool in the market. It’s a task for Azure DevOps Release Pipeline to deploy whole ADF from code (JSON files) to ADF instance in Azure. Behind the scenes, it runs the PowerShell module which does all job for you.
Sounds unbelievable? But it’s real! Check it out for yourself.

Click through for the video.

Comments closed

ADF.Procfwk Version 1.8

Paul Andrew has been busy:

Following more great feedback from the Data Platform community the primary goal of this release was to further improve the resilience of the framework processing. These improvements included its restart clean up capabilities and introducing better dependency chain handling between Worker pipelines when failures occur. The latter builds on the existing restart functionality first introduced in release v1.2 and supplements the logic using a new set of pipeline dependency metadata. I’ve created the below visual to conceptually show the new dependency chain behaviour, should you wish to populate and make use of the new metadata handling.

Read on for the full changelog.

Comments closed