ETL / ELT – Page 4 – Curated SQL

Three Incremental Load Patterns with Azure Data Factory

Published 2025-01-22 by Kevin Feasel

This article is divided into three major sections—each showing the different abilities and use cases of performing incremental load with Azure Data Factory. This process can also be done in an Azure Synapse Pipeline and Fabric Pipeline.

The document contains the following:

Section 1: Copy Data Based on Last Modified Date or Latest File

Section 2: Incremental Copy Using Dataflow

Section 3: Incremental Copy Using Lookup and Stored Procedure Activities

Click through for each of these three patterns, with plenty of screenshots and step-by-step instructions.

Comments closed

Methods to Copy On-Premises SQL Server Data into Microsoft Fabric

Published 2025-01-09 by Kevin Feasel

Gilbert Quevauvilliers runs a test:

In this blog post I am going to determine which item workload uses the least amount of Capacity Units when copying the same data from an On-Premises SQL Server.

The item workloads that I can use to copy data are Dataflow Gen1, Dataflow Gen2 and Pipelines.

Read on for the results, as well as one caveat about them.

Comments closed

Fabric Benchmarking: Moving CSV Files

Published 2024-12-17 by Kevin Feasel

Eugene Meidinger breaks out the abacus:

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

I think Eugene did a fine job documenting everything in the process, and it was interesting to see relative price differences between different techniques for uploading a very large CSV file.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Published 2024-12-13 by Kevin Feasel

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed

Ways to Land Data into Microsoft Fabric OneLake

Published 2024-12-13 by Kevin Feasel

James Serra puts on a cape and takes on an iconic laugh:

Microsoft Fabric is rapidly gaining popularity as a unified data platform, leveraging OneLake as its central data storage hub for all Fabric-integrated products. A variety of tools and methods are available for copying data into OneLake, catering to diverse data ingestion needs. Below is an overview of what I believe are the key options:

Read on for a baker’s dozen methods.

Comments closed

A Gentle Introduction to dbt

Published 2024-12-12 by Kevin Feasel

Koen Verbeeck makes a change:

In a typical data warehouse project, data is loaded, transformed and stored into a data store using an ETL (extract – transform – load) process. Recently, there has been a shift to ELT processing, where data is first loaded into a data store (this can be a database, but also a data lake or something similar) and then transformed in-place.

Over the past years, the tool dbt – short for data build tool – has become quite popular in the data engineering world for handling such an ELT process. dbt takes on the role of the “T”, meaning it’s responsible for transforming the data in a certain data store. dbt is not meant for transferring data from one store to another, you’ll need another tool for this, such as Azure Data Factory for example.

Read on to see how it works.

Comments closed

Data Transformation with Dataflows Gen2

Published 2024-12-10 by Kevin Feasel

Boniface Muchendu provides an overview of Dataflows Gen2 in Microsoft Fabric:

Welcome to a journey into the world of data automation! Imagine working in an organization bustling with data scientists and analysts. In such an environment, you often need to gather and combine data from various sources for further analysis. You could do this manually, but why not leverage automation? In this blog, we’ll explore how to apply automation on data transformations using Dataflows Gen2 in Microsoft Fabric.

Admitting that I am not the primary audience for Dataflows Gen2, I’d still much rather write a Spark notebook and call it a day.

Comments closed

Metadata-Driven Spark Clusters in Azure Databricks

Published 2024-12-10 by Kevin Feasel

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Comments closed

Mounding ADF Instances in Microsoft Fabric

Published 2024-12-06 by Kevin Feasel

Koen Verbeeck has an existing Azure Data Factory:

We recently started using Microsoft Fabric for our cloud data platform. However, we already have quite an estate of Azure data services running in our company, including a huge number of Azure Data Factory (ADF) pipelines. It seems cumbersome to migrate all those pipelines to Microsoft Fabric, especially because some features are not supported yet and ADF is the mature choice at the moment. We like the concept of Microsoft Fabric’s centralization, where everything is managed in one platform. Is there an option to manage ADF in Fabric?

Read on for the answer, but make sure to check out its limitations as well.

Comments closed

AWS DMS and a LOB Bug

Published 2024-11-25 by Kevin Feasel

Richard O’Riordan fixes an issue:

The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.

This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.

Read on for the story and several tests of how things work.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: ETL / ELT