ETL / ELT – Page 14 – Curated SQL

Building Metadata-Driven Pipelines in Microsoft Fabric

Published 2024-05-07 by Kevin Feasel

The goal of metadata driven code is that you build something only once. You need to extract from relational databases? You build one pipeline that can connect to a relational source, and you parameterize everything (server name, database name, source schema, source table, destination server name, destination table et cetera). Once this parameterized piece of code is ready, all you must do is enter metadata about the sources you want to extract. If at a later point an additional relational source needs to be extracted, you don’t need to create a brand-new pipeline. All you need to do is enter a new line of data in your metadata repository.

Aside from speeding up development – after you’ve made the initial effort of creating your metadata driven pipeline – is that everything is consistent. You tackle a certain pattern always in the same way. If there’s a bug, you need to fix it in one single location.

Read on to see how this works. The idea is certainly not new, as Koen mentions, but there are some specific factors that come into play for Microsoft Fabric pipelines.

Comments closed

Three Layers of Azure Data Factory Framework Components

Published 2024-04-24 by Kevin Feasel

Martin Schoombee continues a series on orchestration in Azure Data Factory:

Before we dive into the details of the Data Factory pipelines, it is worth explaining the conceptual structure of my framework and its components. How it all fits together is important, and after reading the post on the metadata as well the pieces of the puzzle will hopefully start falling into place.

When I started thinking about what I’d like the framework to do, three conceptual layers started to emerge and we’ll review them from the bottom up:

Click through for the description of each layer.

Comments closed

Metadata Tables and Azure Data Factory

Published 2024-04-17 by Kevin Feasel

Martin Schoombee brings back metadata tables:

The metadata that drives the execution within a framework is probably the most critical part. Going back to our analogy of building a house, the metadata would be the foundation. It is here where you are going to make some architectural decisions outside of which the framework cannot operate.

One such decision is how configurable or flexible you’d like the framework to be. In other words, how many attributes would you like to be dynamic and/or have the option to change during execution. It seems like an easy choice, and most engineers would lean towards “everything” or “as much as possible” as an answer. In reality however, the trade-off is complexity and the more dynamic you make the framework the more complicated it becomes. And you pay for the complexity later when you need to maintain or add new functionality to it.

Read on to see how it all fits together.

Comments closed

Tips for Configuring Alerts for Azure Data Factory

Published 2024-04-16 by Kevin Feasel

Teo Lachev shares some advice:

Alerting is an important monitoring task for any ETL process. Azure Data Factory can integrate with a generic Azure event framework (Azure Monitor) which makes it somewhat unintuitive for ETL monitoring. You can set up and change the alerts using the ADF Monitoring hub.

Read on for five pieces of advice, in particular, covering how to set up one of these alerts.

Comments closed

A Scaffolding Design Pattern for Microsoft Fabric Pipelines

Published 2024-04-15 by Kevin Feasel

Andy Leonard shares some thoughts on design:

When assigned a project, it’s tempting – and dangerous – to Just Start Coding. If you suffer from the urge to develop first and design later, you are not alone (there’s at least one other developer like you and he’s typing this post). Do yourself a favor and…

Read on for more information on Andy’s design-first mentality and a sample of how you might lay out that initial design.

Comments closed

The Importance of Orchestration in E(L)TL Processes

Published 2024-04-10 by Kevin Feasel

Martin Schoombee begins a new series:

In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.

Read on to see how the Orchestration approach differs from the traditional ETL approach.

Comments closed

Scheduling and Monitoring Fabric Pipeline Executions

Published 2024-03-05 by Kevin Feasel

Andy Leonard shares a tip:

The more I work with Microsoft Fabric Data Factory, the more I like it.

In this post, I share one way to schedule a Microsoft Fabric Data Factory pipeline, a different way to test execute a pipeline, and then one way to monitor the execution.

This is all built right into the Microsoft Fabric UI, making it that much easier to use.

Comments closed

Metadata-Driven Pipelines in Microsoft Fabric

Published 2024-02-27 by Kevin Feasel

John Miner returns to the old ways:

What is a metadata driven pipeline? Wikipedia defines metadata as “data that provides information about other data”. As a developer, we can create a non parameterized pipeline and/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute.

Read on to see how you can accomplish this task.

Comments closed

Full and Incremental Loads in Microsoft Fabric

Published 2024-02-22 by Kevin Feasel

John Miner continues a series on data engineering in Microsoft Fabric:

In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up. How can we achieve these goals using the Apache Spark engine in Microsoft Fabric?

Read on for John’s take on the answer. I’ve found that I have a fairly good answer for smaller datasets, though as the size of the data gets larger, the less I like answers for the raw layer.

Comments closed

Finally Blocks and Error Handling in Data Factory

Published 2024-02-07 by Kevin Feasel

Chen Hirsh doesn’t let failure get in the way of doing work:

Today I stumbled upon a weird behavior in Azure Data Factory (ADF) error handling.

ADF lets us add error handling in the flow control, In this example, I’m trying to copy some data, and if that fails go to on failure branch (red line). If the activity succeeded, go to on success branch (green line)

These work great (If you can call a failure great…).

Let’s take another step. What if I want to run another activity at the end, no matter if the copy succeeded or failed?

The behavior is a bit weird, as it doesn’t work quite the way you’d expect. Chen, however, shows us how to do it.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: ETL / ELT