Press "Enter" to skip to content

Category: ETL / ELT

Metadata Tables and Azure Data Factory

Martin Schoombee brings back metadata tables:

The metadata that drives the execution within a framework is probably the most critical part. Going back to our analogy of building a house, the metadata would be the foundation. It is here where you are going to make some architectural decisions outside of which the framework cannot operate.

One such decision is how configurable or flexible you’d like the framework to be. In other words, how many attributes would you like to be dynamic and/or have the option to change during execution. It seems like an easy choice, and most engineers would lean towards “everything” or “as much as possible” as an answer. In reality however, the trade-off is complexity and the more dynamic you make the framework the more complicated it becomes. And you pay for the complexity later when you need to maintain or add new functionality to it.

Read on to see how it all fits together.

Comments closed

Tips for Configuring Alerts for Azure Data Factory

Teo Lachev shares some advice:

Alerting is an important monitoring task for any ETL process. Azure Data Factory can integrate with a generic Azure event framework (Azure Monitor) which makes it somewhat unintuitive for ETL monitoring. You can set up and change the alerts using the ADF Monitoring hub.

Read on for five pieces of advice, in particular, covering how to set up one of these alerts.

Comments closed

A Scaffolding Design Pattern for Microsoft Fabric Pipelines

Andy Leonard shares some thoughts on design:

When assigned a project, it’s tempting – and dangerous – to Just Start Coding. If you suffer from the urge to develop first and design later, you are not alone (there’s at least one other developer like you and he’s typing this post). Do yourself a favor and…

Read on for more information on Andy’s design-first mentality and a sample of how you might lay out that initial design.

Comments closed

The Importance of Orchestration in E(L)TL Processes

Martin Schoombee begins a new series:

In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.

Read on to see how the Orchestration approach differs from the traditional ETL approach.

Comments closed

Metadata-Driven Pipelines in Microsoft Fabric

John Miner returns to the old ways:

What is a metadata driven pipeline? Wikipedia defines metadata as “data that provides information about other data”. As a developer, we can create a non parameterized pipeline and/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute.

Read on to see how you can accomplish this task.

Comments closed

Full and Incremental Loads in Microsoft Fabric

John Miner continues a series on data engineering in Microsoft Fabric:

In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up. How can we achieve these goals using the Apache Spark engine in Microsoft Fabric?

Read on for John’s take on the answer. I’ve found that I have a fairly good answer for smaller datasets, though as the size of the data gets larger, the less I like answers for the raw layer.

Comments closed

Finally Blocks and Error Handling in Data Factory

Chen Hirsh doesn’t let failure get in the way of doing work:

Today I stumbled upon a weird behavior in Azure Data Factory (ADF) error handling.

ADF lets us add error handling in the flow control, In this example, I’m trying to copy some data, and if that fails go to on failure branch (red line). If the activity succeeded, go to on success branch (green line)

These work great (If you can call a failure great…).

Let’s take another step. What if I want to run another activity at the end, no matter if the copy succeeded or failed?

The behavior is a bit weird, as it doesn’t work quite the way you’d expect. Chen, however, shows us how to do it.

Comments closed

Loading Data from Statistics Denmark into Power BI

Erik Svensen goes over an oldie:

It turns out that the blogpost I wrote 10 years ago about getting data from Statistics Denmark into Power BI with Power Query still is being used – link.

But as the API has changed a bit since then I was asked to do an update of the blogpost – so here is how you can get the population of Denmark imported into Power BI with Power Query.

Read on to see the right way to do it today.

Comments closed

Notebooks versus Dataflow Gen2 in Microsoft Fabric

Gilbert Quevauvilliers takes us through a comparison:

In this blog post I am going to compare Dataflow Gen2 vs Notebook in terms of how much it costs for the workload. I will also compare usability as currently the dataflow gen2 has got a lot of built in features which makes it easier to use.

The goal of this blog post is to understand which in my opinion is cheaper and easier to use, which will then be the focus for future blog posts with regards to what I’ve learned along the way, which will hopefully assist you too.

To compare between the two workloads, I am going to be using the same source file as well as do the same transformations which will result in the same result.

Read on for a surprising difference in cost.

Comments closed