Press "Enter" to skip to content

Category: ETL / ELT

Three Layers of Azure Data Factory Framework Components

Martin Schoombee continues a series on orchestration in Azure Data Factory:

Before we dive into the details of the Data Factory pipelines, it is worth explaining the conceptual structure of my framework and its components. How it all fits together is important, and after reading the post on the metadata as well the pieces of the puzzle will hopefully start falling into place.

When I started thinking about what I’d like the framework to do, three conceptual layers started to emerge and we’ll review them from the bottom up:

Click through for the description of each layer.

Leave a Comment

Metadata Tables and Azure Data Factory

Martin Schoombee brings back metadata tables:

The metadata that drives the execution within a framework is probably the most critical part. Going back to our analogy of building a house, the metadata would be the foundation. It is here where you are going to make some architectural decisions outside of which the framework cannot operate.

One such decision is how configurable or flexible you’d like the framework to be. In other words, how many attributes would you like to be dynamic and/or have the option to change during execution. It seems like an easy choice, and most engineers would lean towards “everything” or “as much as possible” as an answer. In reality however, the trade-off is complexity and the more dynamic you make the framework the more complicated it becomes. And you pay for the complexity later when you need to maintain or add new functionality to it.

Read on to see how it all fits together.

Leave a Comment

A Scaffolding Design Pattern for Microsoft Fabric Pipelines

Andy Leonard shares some thoughts on design:

When assigned a project, it’s tempting – and dangerous – to Just Start Coding. If you suffer from the urge to develop first and design later, you are not alone (there’s at least one other developer like you and he’s typing this post). Do yourself a favor and…

Read on for more information on Andy’s design-first mentality and a sample of how you might lay out that initial design.

Leave a Comment

The Importance of Orchestration in E(L)TL Processes

Martin Schoombee begins a new series:

In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.

Read on to see how the Orchestration approach differs from the traditional ETL approach.

Leave a Comment

Metadata-Driven Pipelines in Microsoft Fabric

John Miner returns to the old ways:

What is a metadata driven pipeline? Wikipedia defines metadata as “data that provides information about other data”. As a developer, we can create a non parameterized pipeline and/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute.

Read on to see how you can accomplish this task.

Comments closed

Full and Incremental Loads in Microsoft Fabric

John Miner continues a series on data engineering in Microsoft Fabric:

In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up. How can we achieve these goals using the Apache Spark engine in Microsoft Fabric?

Read on for John’s take on the answer. I’ve found that I have a fairly good answer for smaller datasets, though as the size of the data gets larger, the less I like answers for the raw layer.

Comments closed

Finally Blocks and Error Handling in Data Factory

Chen Hirsh doesn’t let failure get in the way of doing work:

Today I stumbled upon a weird behavior in Azure Data Factory (ADF) error handling.

ADF lets us add error handling in the flow control, In this example, I’m trying to copy some data, and if that fails go to on failure branch (red line). If the activity succeeded, go to on success branch (green line)

These work great (If you can call a failure great…).

Let’s take another step. What if I want to run another activity at the end, no matter if the copy succeeded or failed?

The behavior is a bit weird, as it doesn’t work quite the way you’d expect. Chen, however, shows us how to do it.

Comments closed

Loading Data from Statistics Denmark into Power BI

Erik Svensen goes over an oldie:

It turns out that the blogpost I wrote 10 years ago about getting data from Statistics Denmark into Power BI with Power Query still is being used – link.

But as the API has changed a bit since then I was asked to do an update of the blogpost – so here is how you can get the population of Denmark imported into Power BI with Power Query.

Read on to see the right way to do it today.

Comments closed