ETL / ELT – Page 4 – Curated SQL

Controlling Execution Flow in Fabric Data Pipelines

Published 2025-02-06 by Kevin Feasel

In Microsoft Fabric, the Data Factory is the workload for ETL and data integration, and the Data Pipeline is a component in that workload for orchestrating the execution flow. There are activities in the pipeline, and you can define in which order you want the activities to run. In this article and video, you will learn about the execution order and output states in Data Pipeline and how they can be used in real-world scenarios of data integration.

The mechanisms here are fundamentally similar to what we’ve had in Azure Data Factory (obviously) and SQL Server Integration Services.

Comments closed

Notes on Change Tracking for Warehouse Incremental Loads

Published 2025-01-31 by Kevin Feasel

Meagan Longoria shares some hard-earned experience:

I have a few clients that incrementally load tables from a SQL Server source into their data warehouse or lakehouse by using change tracking. Lately, they encountered some issues with changes to the configuration and the data in the source database, so I decided to share some things you can check before using change tracking as part of your ETL load or when troubleshooting your data load.

Click through for three common issues you may run into while using change tracking.

Comments closed

Converting Excel Dates and Times to SQL

Published 2025-01-30 by Kevin Feasel

Kristyna Ferris marks the date:

Hey data friends! This one comes from my personal vault (aka backlog of drafts I’ve been needing to write up) and is a really simple code that I always forget how to do. So, to save us all some ChatGPT-ing, here’s my tried-and-true way of converting Excel Date & Time fields to a true Date & Time in SQL.

Click through for an example of the process.

Comments closed

Automatically Refreshing a Power BI Semantic Model after Dataflow Loads

Published 2025-01-28 by Kevin Feasel

Reza Rad refreshes a model:

Although this seems to be a simple thing to do, it is not a function that you can turn on or off. If you have a Dataflow that does the ETL and transforms and prepares the data, then to get the most up-to-date data into the report, you will need to refresh the Power BI semantic model after that, only upon successful refresh of both dataflow and semantic model is when you will have the up-to-date data into the report. Fortunately, in Fabric, this is a straightforward setup. In this article and video, I’ll explain how this is possible.

Click through for the video and the blog post. Granted, this feature is in preview, but using it is pretty straightforward.

Comments closed

MDX vs DAX for Bulk Data Extraction from Power BI

Published 2025-01-27 by Kevin Feasel

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Comments closed

Three Incremental Load Patterns with Azure Data Factory

Published 2025-01-22 by Kevin Feasel

Temidayo Omoniyi likes a good pattern:

This article is divided into three major sections—each showing the different abilities and use cases of performing incremental load with Azure Data Factory. This process can also be done in an Azure Synapse Pipeline and Fabric Pipeline.

The document contains the following:

Section 1: Copy Data Based on Last Modified Date or Latest File

Section 2: Incremental Copy Using Dataflow

Section 3: Incremental Copy Using Lookup and Stored Procedure Activities

Click through for each of these three patterns, with plenty of screenshots and step-by-step instructions.

Comments closed

Methods to Copy On-Premises SQL Server Data into Microsoft Fabric

Published 2025-01-09 by Kevin Feasel

Gilbert Quevauvilliers runs a test:

In this blog post I am going to determine which item workload uses the least amount of Capacity Units when copying the same data from an On-Premises SQL Server.

The item workloads that I can use to copy data are Dataflow Gen1, Dataflow Gen2 and Pipelines.

Read on for the results, as well as one caveat about them.

Comments closed

Fabric Benchmarking: Moving CSV Files

Published 2024-12-17 by Kevin Feasel

Eugene Meidinger breaks out the abacus:

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

I think Eugene did a fine job documenting everything in the process, and it was interesting to see relative price differences between different techniques for uploading a very large CSV file.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Published 2024-12-13 by Kevin Feasel

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed

Ways to Land Data into Microsoft Fabric OneLake

Published 2024-12-13 by Kevin Feasel

James Serra puts on a cape and takes on an iconic laugh:

Microsoft Fabric is rapidly gaining popularity as a unified data platform, leveraging OneLake as its central data storage hub for all Fabric-integrated products. A variety of tools and methods are available for copying data into OneLake, catering to diverse data ingestion needs. Below is an overview of what I believe are the key options:

Read on for a baker’s dozen methods.

Comments closed

Category: ETL / ELT