ETL / ELT – Page 9 – Curated SQL

Fabric Data Pipeline for Blob Storage CSV into Azure SQL DB

Published 2024-01-02 by Kevin Feasel

In November 2023, I shared how to start learning Microsoft Fabric in a post titled Start a Fabric Free Trial. In December 2023, I shared how to Create a Workspace in Fabric. In this post, I document one way to create a pipeline to load data from a CSV file stored in Azure Blob Storage to Azure SQL Database in your new Fabric workspace.

Click through for some key assumptions, as well as the process.

Comments closed

Fabric F2 Performance

Published 2023-12-26 by Kevin Feasel

Teo Lachev has started a new series. We begin with warehouse ETL:

As inspired by Amir Netz‘s encouragement to partners to test the Fabric F2 capacity performance, I got on a quest to test what it would do to ETL loads for Fabric Warehouse. I must admit that I was skeptical that a quarter of a core would take a warehouse off the ground, but as usual, life proved me wrong and “wrong” is a big understatement of what happened.

After provisioning a Fabric F2 capacity and a warehouse, I settled on the Retail Data Model for World Wide Importers sample star schema dataset consisting of five dimension tables and one fact table. In terms of performance, I was mostly interested in how long it would take for the ADF copy activity to insert all the data (50 million rows) in the fact table. Granted, it’s a limited test but enough to rule out the technology for real-life projects. Then, I compared the performance against Azure SQL Database Serverless running on up to 2 cores and provisioned by the free trial offer that Microsoft has on Azure. To exclude impact on data transfer between regions, both technologies were provisioned on East US 2 data region, which is the region where my Power BI tenant is hosted on.

Then we have report load time:

What a better way to spend a lazy holiday afternoon than to do more Fabric performance testing? In my previous post, I shared my results from a single-threaded ETL load test to gauge the F2 ingest performance and F2 did pretty well (or at least outperformed Azure SQL DB). Will F2 hold as parallelism increases? Throughput testing is especially important for report loads because parallel tasks can run within a report, such as visuals executing DAX queries in parallel, and across reports, such as when concurrent report requests overlap.

I’m legitimately surprised at the results. I expected F2 to be barely sufficient for testing purposes. Read both posts to see how it performs and some caveats around performance.

Comments closed

Trying to Load a Table in Microsoft Fabric

Published 2023-12-19 by Kevin Feasel

Eugene Meidinger walks onto a field of rakes:

Last week, I struggled to load the data into Fabric, but finally got it into a Lakehouse. I was starting to run into a lot of frustration, and so it seemed like a good time to back up and get more oriented about the different pieces of Fabric and how they fit together. In my experience, it’s often most effective to try to do something, review some learning, and alternate. Without a particular pain point, it’s hard for the information to stick.

Read on for some thoughts on andragogy, learning paths, and travails loading data.

Comments closed

Warehousing and Power BI in Microsoft Fabric

Published 2023-12-18 by Kevin Feasel

Tomaz Kastrun continues a series on Microsoft Fabric. Day 15 covers building a warehouse:

I have named my as “Advent2023_DWH”.

You can create a warehouse using T-SQL scripts, from data flow gen2, from data pipelines and from the sample data. Let’s select the sample data and grab a coffee.

Day 16 looks at data pipelines:

With the Fabric warehouse created and explored, let’s see, how we can use pipelines to get the data into Fabric warehouse.

In the existing data warehouse, we will introduce new data. By clicking “new data”, two options will be available; pipelines and dataflows. Select the pipelines and give it a name.

And Day 17 provides a primer on how Power BI can read Fabric assets:

Within the Power BI in Fabric, you will find many of the components, that can be used to create a final report. And here are the components:

Comments closed

Deactivating Pipeline Activities in Microsoft Fabric

Published 2023-11-14 by Kevin Feasel

Koen Verbeeck shows us a convenient action you can perform in Microsoft Fabric pipelines:

A while ago I had a little blog post series about cool stuff in Snowflake. I’m doing a similar series now, but this time for Microsoft Fabric. I’m not going to cover the basic of Fabric, hundreds of bloggers have already done that. I’m going to cover little bits & pieces that I find interesting, that are similar to Snowflake features or something that is an improvement over the “regular” SQL Server or related products.

In this blog post I’m highlighting the fact we can now deactivate activities in a pipeline

Read on to see how you can do this and what the implications of the action are.

Comments closed

An Overview of Data Lake Operations with Apache NiFi

Published 2023-11-02 by Kevin Feasel

Lav Kumar gives us a 10,000 foot view:

In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment.

Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ELT processing.

Comments closed

Loading Data from Sharepoint Lists into Microsoft Fabric

Published 2023-10-31 by Kevin Feasel

Stepan Resl loads some data:

In a time of Fabric, it’s worth pointing out our three options for data ingestion.

Data Pipelines with Copy Activity

Dataflows Gen 2

Notebooks

We must compare them to understand what each can offer us from different perspectives. To be able to compare them thoroughly, there are some guardrails that we need to set so that everything goes the same way.

My biggest takeaway from this is, don’t load important business data into Sharepoint Lists to begin with.

Comments closed

Cross-Environment Schema Comparison for ADF

Published 2023-10-25 by Kevin Feasel

Teo Lachev has a script for us:

So, I got on a quest to find an ADF schema comparison tool for a quick and dirty way to identify ADF code discrepancies between two environments and promote changes manually. Ideally, the tool would be something like SQL Server Database projects in Visual Studio.

Read on to see what Teo was able to find, as well as a few code changes around it.

Comments closed

SQL Server Data Import and Export via File

Published 2023-10-24 by Kevin Feasel

Ed Pollack opens an import-export business:

For the purposes of this article, we will focus solely on the task of moving a data set from one server to another. Topics such as ETL, ELT, data warehousing, data lakes, etc…are important and relevant to data movement, but out of scope for a focused discussion such as this.

Ed touches on why you might want to use files and then shares his recommendations for generating files from SQL Server data as well as importing data from flat files into SQL Server.

Comments closed

A Critique (and Defense) of Generic Programming Languages for ETL/ELT

Published 2023-10-17 by Kevin Feasel

Teo Lachev doesn’t like general programming languages for ETL and ELT operations:

Someone asked the other day for my opinion about the open-source dbt tool for ETL. I hadn’t heard about it. Next thing I’ve noticed was that Fabric Warehouse added support for it, so I got inspired to take a first look. Seems like an ELT-oriented tool. Good, I’m a big fan of the ELT pattern whose virtues I extolled I discussed many times here. But a Python-based tool that requires writing custom code for orchestration in a dev environment, such as Visual Studio Code? Yuck!

My reasoning is simple: complexity. Bespoke ETL/ELT tools like SQL Server Information Services, Informatica, Azure Data Factory, Airflow, and the like are good when you fit into their primary use cases: moving data from a few data sources into a destination, perhaps with some level of transformation in between.

But here are areas off the top of my head where I’ve seen these tools not work well:

Wide scale. In one environment, we had to move contents from a couple thousand databases (with identical schemas) across 50-60 instances of SQL Server into a warehouse, including some facts and dimensions we needed within a minute or two. Even assuming those packages don’t change frequently—not a reasonable assumption—the pains of orchestrating that would be enormous. I don’t think we could have used metadata-driven approach and foreach loops in your ADF workflows, either, as that would not satisfy the time requirements. There are also resource limitation requirements on the other side—you don’t want to overwhelm the warehouse by trying to process a couple thousand clients’ worth of data all at once, so you’ve got to stagger this work using an orchestration engine with enough smarts to limit concurrent processes.
Limiting copy-paste efforts and drudgery. Going back to SSIS, it sucks having to maintain dozens of packages, especially common components you need to update in each one. I got to be pretty good at Biml, but a) that has its limits, and b) that’s C# development with SSIS packages as an output, so I’m claiming that for the generic programming languages side of the argument.

Comments closed

Category: ETL / ELT