ETL / ELT – Page 9 – Curated SQL

Metadata-Driven Pipelines in Microsoft Fabric

Published 2024-02-27 by Kevin Feasel

What is a metadata driven pipeline? Wikipedia defines metadata as “data that provides information about other data”. As a developer, we can create a non parameterized pipeline and/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute.

Read on to see how you can accomplish this task.

Comments closed

Full and Incremental Loads in Microsoft Fabric

Published 2024-02-22 by Kevin Feasel

John Miner continues a series on data engineering in Microsoft Fabric:

In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up. How can we achieve these goals using the Apache Spark engine in Microsoft Fabric?

Read on for John’s take on the answer. I’ve found that I have a fairly good answer for smaller datasets, though as the size of the data gets larger, the less I like answers for the raw layer.

Comments closed

Finally Blocks and Error Handling in Data Factory

Published 2024-02-07 by Kevin Feasel

Chen Hirsh doesn’t let failure get in the way of doing work:

Today I stumbled upon a weird behavior in Azure Data Factory (ADF) error handling.

ADF lets us add error handling in the flow control, In this example, I’m trying to copy some data, and if that fails go to on failure branch (red line). If the activity succeeded, go to on success branch (green line)

These work great (If you can call a failure great…).

Let’s take another step. What if I want to run another activity at the end, no matter if the copy succeeded or failed?

The behavior is a bit weird, as it doesn’t work quite the way you’d expect. Chen, however, shows us how to do it.

Comments closed

Loading Data from Statistics Denmark into Power BI

Published 2024-01-26 by Kevin Feasel

Erik Svensen goes over an oldie:

It turns out that the blogpost I wrote 10 years ago about getting data from Statistics Denmark into Power BI with Power Query still is being used – link.

But as the API has changed a bit since then I was asked to do an update of the blogpost – so here is how you can get the population of Denmark imported into Power BI with Power Query.

Read on to see the right way to do it today.

Comments closed

Notebooks versus Dataflow Gen2 in Microsoft Fabric

Published 2024-01-25 by Kevin Feasel

Gilbert Quevauvilliers takes us through a comparison:

In this blog post I am going to compare Dataflow Gen2 vs Notebook in terms of how much it costs for the workload. I will also compare usability as currently the dataflow gen2 has got a lot of built in features which makes it easier to use.

The goal of this blog post is to understand which in my opinion is cheaper and easier to use, which will then be the focus for future blog posts with regards to what I’ve learned along the way, which will hopefully assist you too.

To compare between the two workloads, I am going to be using the same source file as well as do the same transformations which will result in the same result.

Read on for a surprising difference in cost.

Comments closed

Fabric Data Pipeline for Blob Storage CSV into Azure SQL DB

Published 2024-01-02 by Kevin Feasel

Andy Leonard loads some data:

In November 2023, I shared how to start learning Microsoft Fabric in a post titled Start a Fabric Free Trial. In December 2023, I shared how to Create a Workspace in Fabric. In this post, I document one way to create a pipeline to load data from a CSV file stored in Azure Blob Storage to Azure SQL Database in your new Fabric workspace.

Click through for some key assumptions, as well as the process.

Comments closed

Fabric F2 Performance

Published 2023-12-26 by Kevin Feasel

Teo Lachev has started a new series. We begin with warehouse ETL:

As inspired by Amir Netz‘s encouragement to partners to test the Fabric F2 capacity performance, I got on a quest to test what it would do to ETL loads for Fabric Warehouse. I must admit that I was skeptical that a quarter of a core would take a warehouse off the ground, but as usual, life proved me wrong and “wrong” is a big understatement of what happened.

After provisioning a Fabric F2 capacity and a warehouse, I settled on the Retail Data Model for World Wide Importers sample star schema dataset consisting of five dimension tables and one fact table. In terms of performance, I was mostly interested in how long it would take for the ADF copy activity to insert all the data (50 million rows) in the fact table. Granted, it’s a limited test but enough to rule out the technology for real-life projects. Then, I compared the performance against Azure SQL Database Serverless running on up to 2 cores and provisioned by the free trial offer that Microsoft has on Azure. To exclude impact on data transfer between regions, both technologies were provisioned on East US 2 data region, which is the region where my Power BI tenant is hosted on.

Then we have report load time:

What a better way to spend a lazy holiday afternoon than to do more Fabric performance testing? In my previous post, I shared my results from a single-threaded ETL load test to gauge the F2 ingest performance and F2 did pretty well (or at least outperformed Azure SQL DB). Will F2 hold as parallelism increases? Throughput testing is especially important for report loads because parallel tasks can run within a report, such as visuals executing DAX queries in parallel, and across reports, such as when concurrent report requests overlap.

I’m legitimately surprised at the results. I expected F2 to be barely sufficient for testing purposes. Read both posts to see how it performs and some caveats around performance.

Comments closed

Trying to Load a Table in Microsoft Fabric

Published 2023-12-19 by Kevin Feasel

Eugene Meidinger walks onto a field of rakes:

Last week, I struggled to load the data into Fabric, but finally got it into a Lakehouse. I was starting to run into a lot of frustration, and so it seemed like a good time to back up and get more oriented about the different pieces of Fabric and how they fit together. In my experience, it’s often most effective to try to do something, review some learning, and alternate. Without a particular pain point, it’s hard for the information to stick.

Read on for some thoughts on andragogy, learning paths, and travails loading data.

Comments closed

Warehousing and Power BI in Microsoft Fabric

Published 2023-12-18 by Kevin Feasel

Tomaz Kastrun continues a series on Microsoft Fabric. Day 15 covers building a warehouse:

I have named my as “Advent2023_DWH”.

You can create a warehouse using T-SQL scripts, from data flow gen2, from data pipelines and from the sample data. Let’s select the sample data and grab a coffee.

Day 16 looks at data pipelines:

With the Fabric warehouse created and explored, let’s see, how we can use pipelines to get the data into Fabric warehouse.

In the existing data warehouse, we will introduce new data. By clicking “new data”, two options will be available; pipelines and dataflows. Select the pipelines and give it a name.

And Day 17 provides a primer on how Power BI can read Fabric assets:

Within the Power BI in Fabric, you will find many of the components, that can be used to create a final report. And here are the components:

Comments closed

Deactivating Pipeline Activities in Microsoft Fabric

Published 2023-11-14 by Kevin Feasel

Koen Verbeeck shows us a convenient action you can perform in Microsoft Fabric pipelines:

A while ago I had a little blog post series about cool stuff in Snowflake. I’m doing a similar series now, but this time for Microsoft Fabric. I’m not going to cover the basic of Fabric, hundreds of bloggers have already done that. I’m going to cover little bits & pieces that I find interesting, that are similar to Snowflake features or something that is an improvement over the “regular” SQL Server or related products.

In this blog post I’m highlighting the fact we can now deactivate activities in a pipeline

Read on to see how you can do this and what the implications of the action are.

Comments closed

Category: ETL / ELT