Press "Enter" to skip to content

Category: ETL / ELT

Deactivating Pipeline Activities in Microsoft Fabric

Koen Verbeeck shows us a convenient action you can perform in Microsoft Fabric pipelines:

A while ago I had a little blog post series about cool stuff in Snowflake. I’m doing a similar series now, but this time for Microsoft Fabric. I’m not going to cover the basic of Fabric, hundreds of bloggers have already done that. I’m going to cover little bits & pieces that I find interesting, that are similar to Snowflake features or something that is an improvement over the “regular” SQL Server or related products.

In this blog post I’m highlighting the fact we can now deactivate activities in a pipeline

Read on to see how you can do this and what the implications of the action are.

Comments closed

An Overview of Data Lake Operations with Apache NiFi

Lav Kumar gives us a 10,000 foot view:

In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment.

Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ELT processing.

Comments closed

Loading Data from Sharepoint Lists into Microsoft Fabric

Stepan Resl loads some data:

In a time of Fabric, it’s worth pointing out our three options for data ingestion.

  • Data Pipelines with Copy Activity
  • Dataflows Gen 2
  • Notebooks

We must compare them to understand ​​what each can offer us from different perspectives. To be able to compare them thoroughly, there are some guardrails that we need to set so that everything goes the same way.

My biggest takeaway from this is, don’t load important business data into Sharepoint Lists to begin with.

Comments closed

SQL Server Data Import and Export via File

Ed Pollack opens an import-export business:

For the purposes of this article, we will focus solely on the task of moving a data set from one server to another. Topics such as ETL, ELT, data warehousing, data lakes, etc…are important and relevant to data movement, but out of scope for a focused discussion such as this.

Ed touches on why you might want to use files and then shares his recommendations for generating files from SQL Server data as well as importing data from flat files into SQL Server.

Comments closed

A Critique (and Defense) of Generic Programming Languages for ETL/ELT

Teo Lachev doesn’t like general programming languages for ETL and ELT operations:

Someone asked the other day for my opinion about the open-source dbt tool for ETL. I hadn’t heard about it. Next thing I’ve noticed was that Fabric Warehouse added support for it, so I got inspired to take a first look. Seems like an ELT-oriented tool. Good, I’m a big fan of the ELT pattern whose virtues I extolled I discussed many times here. But a Python-based tool that requires writing custom code for orchestration in a dev environment, such as Visual Studio Code? Yuck!

My reasoning is simple: complexity. Bespoke ETL/ELT tools like SQL Server Information Services, Informatica, Azure Data Factory, Airflow, and the like are good when you fit into their primary use cases: moving data from a few data sources into a destination, perhaps with some level of transformation in between.

But here are areas off the top of my head where I’ve seen these tools not work well:

  • Wide scale. In one environment, we had to move contents from a couple thousand databases (with identical schemas) across 50-60 instances of SQL Server into a warehouse, including some facts and dimensions we needed within a minute or two. Even assuming those packages don’t change frequently—not a reasonable assumption—the pains of orchestrating that would be enormous. I don’t think we could have used metadata-driven approach and foreach loops in your ADF workflows, either, as that would not satisfy the time requirements. There are also resource limitation requirements on the other side—you don’t want to overwhelm the warehouse by trying to process a couple thousand clients’ worth of data all at once, so you’ve got to stagger this work using an orchestration engine with enough smarts to limit concurrent processes.
  • Limiting copy-paste efforts and drudgery. Going back to SSIS, it sucks having to maintain dozens of packages, especially common components you need to update in each one. I got to be pretty good at Biml, but a) that has its limits, and b) that’s C# development with SSIS packages as an output, so I’m claiming that for the generic programming languages side of the argument.
Comments closed

RCSI and ID-Driven ETL

Michael J. Swart shares a warning:

Yesterday, Kendra Little talked a bit about Lost Updates under RCSI. It’s a minor issue that can pop up after turning on RCSI as the default behavior for the Read Committed isolation level. But she doesn’t want to dissuade you from considering the option and I agree with that advice.

In fact, even though we turned RCSI on years ago, by a bizarre coincidence, we only came across our first RCSI-related issue very recently. But it wasn’t update related. Instead, it has to do with an ETL process. To explain it better, consider this demo:

Michael has one example solution. I could also see a “windback” run, where, instead of starting at the very end of the line for ETL, you start a few hundred rows earlier. That way, you can pick up any stragglers. It would add some overhead to the ETL task, but given that ETL jobs should be idempotent, it shouldn’t affect the end results.

Comments closed

Metadata-Driven Pipelines for Azure Data Factory Loads

Marc Bushong doesn’t want to copy and paste:

Developing ETLs/ELTs can be a complex process when you add in business logic, large amounts of data, and the high volume of table data that needs to be moved from source to target. This is especially true in analytical workloads involving Azure SQL when there is a need to either fully reload a table or incrementally update a table. In order to handle the logic to incrementally update a table or fully reload a table in Azure SQL (or Azure Synapse), we will need to create the following assets:

  • Metadata table in Azure SQL
    • This will contain the configurations needed to load each table end to end
  • Metadata driven pipelines
    • Parent and child pipeline templates that will orchestrate and execute the ETL/ELT end to end
  • Custom SQL logic for incremental processing
    • Dynamic SQL to perform the delete and insert based on criteria the user provides in the metadata table

Read on for the demonstration, which reads from one Azure SQL DB into another.

Comments closed

Exporting Dynamics 365 Data into Delta Lake via Synapse Link

Jose Mendes performs a data migration:

It’s fair to say there have been some considerable changes in the Azure landscape over recent years.

This blog will show you how to configure Synapse Link to export D365 data in the Delta Lake format – an open-source data and transaction storage file format used in Lakehouse implementations.

Before you start considering using this approach, you will need to ensure you meet the following prerequisites (Microsoft documentation).

Read on for those prerequisites as well as a step-by-step guide on how to do it.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed