Press "Enter" to skip to content

Category: ETL / ELT

RCSI and ID-Driven ETL

Michael J. Swart shares a warning:

Yesterday, Kendra Little talked a bit about Lost Updates under RCSI. It’s a minor issue that can pop up after turning on RCSI as the default behavior for the Read Committed isolation level. But she doesn’t want to dissuade you from considering the option and I agree with that advice.

In fact, even though we turned RCSI on years ago, by a bizarre coincidence, we only came across our first RCSI-related issue very recently. But it wasn’t update related. Instead, it has to do with an ETL process. To explain it better, consider this demo:

Michael has one example solution. I could also see a “windback” run, where, instead of starting at the very end of the line for ETL, you start a few hundred rows earlier. That way, you can pick up any stragglers. It would add some overhead to the ETL task, but given that ETL jobs should be idempotent, it shouldn’t affect the end results.

Comments closed

Metadata-Driven Pipelines for Azure Data Factory Loads

Marc Bushong doesn’t want to copy and paste:

Developing ETLs/ELTs can be a complex process when you add in business logic, large amounts of data, and the high volume of table data that needs to be moved from source to target. This is especially true in analytical workloads involving Azure SQL when there is a need to either fully reload a table or incrementally update a table. In order to handle the logic to incrementally update a table or fully reload a table in Azure SQL (or Azure Synapse), we will need to create the following assets:

  • Metadata table in Azure SQL
    • This will contain the configurations needed to load each table end to end
  • Metadata driven pipelines
    • Parent and child pipeline templates that will orchestrate and execute the ETL/ELT end to end
  • Custom SQL logic for incremental processing
    • Dynamic SQL to perform the delete and insert based on criteria the user provides in the metadata table

Read on for the demonstration, which reads from one Azure SQL DB into another.

Comments closed

Exporting Dynamics 365 Data into Delta Lake via Synapse Link

Jose Mendes performs a data migration:

It’s fair to say there have been some considerable changes in the Azure landscape over recent years.

This blog will show you how to configure Synapse Link to export D365 data in the Delta Lake format – an open-source data and transaction storage file format used in Lakehouse implementations.

Before you start considering using this approach, you will need to ensure you meet the following prerequisites (Microsoft documentation).

Read on for those prerequisites as well as a step-by-step guide on how to do it.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed

Enabling Staging for Microsoft Fabric Dataflows

Chris Webb shares some thoughts:

If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower.

Chris shares a simple example of when staging might not be reasonable. This is going to be the less common scenario, however.

Comments closed

Storing Log Analytics Data in the Microsoft Fabric Lakehouse

Gilbert Quevauvilliers needs a place to store this data:

Following on in my series, in this blog post I am going to use the dataflow Gen2 in Microsoft Fabric to load the data into a lake house table.

By doing this, it will allow me to store the data in a delta lake table.

In this series I am going to show you all the steps I did to have the successful outcome I had with my client.

Click through for links to the first two parts of the series, as well as a step-by-step guide for part 3.

Comments closed

A Brief Overview of 21 ETL Tools in Python

Adron Hall makes a list:

Here are summaries of each of the tools you’ve mentioned along with examples of how to implement the ETL (Extract, Transform, Load) process using each tool within a Python workflow:

  1. Apache Spark: Apache Spark is a powerful open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It’s commonly used for processing large-scale data and running complex ETL pipelines. Example Implementation:

Read on for summaries and samples for each of the 21 options.

Comments closed

Storing Log Analytics Queries in Azure Blob Storage

Gilbert Quevauvilliers wants some long-term storage:

Following on in my series, in this blog post I am going to demonstrate how to store Log Analytics Queries in Blob Storage.

This allows me to be able to store the Power BI Queries externally from Log Analytics and to have an easy way to get the data into my Fabric Lake house in later steps. To do this I am going to use a Logic App in Azure.

In this series I am going to show you all the steps I did to have the successful outcome I had with my client.

Read on to see what Gilbert used for the task.

Comments closed

Lessons Learned from Azure Data Factory Integrating with DB/2 on Mainframe

Teo Lachev shares some thoughts:

I’ve done a few BI integration projects extracting data from ERPs running on IBM Db2. Most of the implementations would use a hybrid architecture where the ERP would be running on an on-prem mainframe while the data was loaded in Microsoft Azure. Here are a few tips if you’re facing this challenge:

Click through for five major points. Surprisingly, one of them isn’t “Avoid DB/2 like the plague.”

Comments closed

Diving into the Microsoft Fabric Copy Activity

Reza Rad does more than copies:

Copy Activity is one of the most commonly used activities in Microsoft Fabric’s Data Factory Pipeline. The Copy Activity copies the data from a source to a destination. However, there is more to that rather than just a simple copy. In this article, you will learn what Copy Activity is, its rationale, how it works, and its configuration options.

Reza has a video, as well as a demo-heavy full-length article on the topic.

Comments closed