ETL / ELT – Page 7 – Curated SQL

Welcome to a journey into the world of data automation! Imagine working in an organization bustling with data scientists and analysts. In such an environment, you often need to gather and combine data from various sources for further analysis. You could do this manually, but why not leverage automation? In this blog, we’ll explore how to apply automation on data transformations using Dataflows Gen2 in Microsoft Fabric.

Admitting that I am not the primary audience for Dataflows Gen2, I’d still much rather write a Spark notebook and call it a day.

Comments closed

Metadata-Driven Spark Clusters in Azure Databricks

Published 2024-12-10 by Kevin Feasel

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Comments closed

Mounding ADF Instances in Microsoft Fabric

Published 2024-12-06 by Kevin Feasel

Koen Verbeeck has an existing Azure Data Factory:

We recently started using Microsoft Fabric for our cloud data platform. However, we already have quite an estate of Azure data services running in our company, including a huge number of Azure Data Factory (ADF) pipelines. It seems cumbersome to migrate all those pipelines to Microsoft Fabric, especially because some features are not supported yet and ADF is the mature choice at the moment. We like the concept of Microsoft Fabric’s centralization, where everything is managed in one platform. Is there an option to manage ADF in Fabric?

Read on for the answer, but make sure to check out its limitations as well.

Comments closed

AWS DMS and a LOB Bug

Published 2024-11-25 by Kevin Feasel

Richard O’Riordan fixes an issue:

The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.

This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.

Read on for the story and several tests of how things work.

Comments closed

Setting a Default Destination for Fabric Dataflows Gen2

Published 2024-11-21 by Kevin Feasel

Jon Voge wants to spend less time copying and pasting:

Ever had a Dataflow Gen2 in which you needed to map the output of several queries to the same Warehouse or Lakehouse? Takes a while to setup, right?

If you wish to add a Default Destination to your Dataflow, all you need to do is to create the Dataflow from inside your desired destination. This works for both Warehouses, Lakehouses and KQL Databases:

Click through for an example of how it works.

Comments closed

Execute a Collection of Child Pipelines from Metadata in Data Factory

Published 2024-11-12 by Kevin Feasel

Andy Leonard continues a series on design patterns:

In this post, I clone and modify the dynamic parent pipeline from the previous post to retrieve metadata from an Azure SQL database table for several child pipelines, and then call each child pipeline from a parent pipeline.

When we’re done, this pipeline will:

Read pipeline metadata from a table in an Azure SQL database

Store some of the metadata (a collection of pipelineID values) in the (existing) pipelineIdArray variable

Iterate the pipelineIdArray variable’s collection of pipelineID values

Execute each child pipeline represented by each pipelineID value stored in the pipelineIdArray variable

Read on to learn how.

Comments closed

Move Data between Lakehouses and Workspaces in Microsoft Fabric

Published 2024-11-06 by Kevin Feasel

Gilbert Quevauvilliers performs an exfiltration:

With the new Schema’s in a Lakehouse, it now is possible to read from Lakehouse A (In Workspace A) and write to Lakehouse B (In Workspace B).

Here are more details about the Schema preview: Lakehouse schemas (Preview) – Microsoft Fabric | Microsoft Learn

This opens a whole new world of possibilities.

I also really like the fact that I can simply use the Names, and I do not need to get the actual GUIDS!

For example, I can use the following as shown below which is WorkspaceName.LakehouseName,SchemaName.TableName

Click through to see it in action.

Comments closed

Dynamically Start a Collection of Child Pipelines in Fabric Data Factory

Published 2024-10-24 by Kevin Feasel

Andy Leonard continues a series on Microsoft Fabric Data Factory:

In this post, I modify the dynamic parent pipeline from the previous post to explore calling several child pipelines that may be called by a parent pipeline. In this post, we will:

Clone the child pipeline (twice)

Copy the cloned child pipeline id values

Clone the dynamic parent pipeline from the previous post

Add and configure a pipeline variable for an array of child pipeline ids

Add and configure a ForEach

Move the “Invoke Pipeline (Preview)” activity

Configure the “ForEach”

Configure the “Invoke Pipeline (Preview)” Activity to Use “ForEach” Items

Test the execution of a dynamic collection of child pipelines

Andy’s got quite a bit in this post, so check it out.

Comments closed

Dynamically Start a Child Pipeline in Fabric Data Factory

Published 2024-10-21 by Kevin Feasel

Andy Leonard continues a series on Fabric Data Factory design patterns:

In an earlier post titled Fabric Data Factory Design Pattern – Basic Parent-Child, I demonstrated one way to build a basic parent-child design pattern in Fabric Data Factory by calling one pipeline (child) from another pipeline (parent). In a later earlier post titled Fabric Data Factory Design Pattern – Parent-Child with Parameters, I modified the parent and child pipelines to demonstrate passing a parameter value from a parent pipeline when calling a child pipeline that contains a parameter.

In this post, I modify a parent pipeline to explore parameterizing which child pipeline will be called by the parent pipeline. In this post, we will:

Copy the child pipeline id

Clone a parent pipeline

Add and configure a pipeline variable for the child pipeline id

Test the dynamic pipeline id

Read on to see how.

Comments closed

Restarting Failed Control Flows in Azure Data Factory

Published 2024-10-15 by Kevin Feasel

Meagan Longoria doesn’t want to repeat good work:

I presented at SQL Saturday Pittshburgh this past weekend about populating your data warehouse with a metadata-driven, pattern-based approach. One of the benefits I mentioned is that it’s easy to employ this pattern for restartability.

For instance, let’s say I am loading data from 30 tables and 5 files into the staging area of my data mart or data warehouse, and one of table loads fails. I don’t want to reload the other tables I just loaded. I want to load the ones that have not been recently loaded. Or let’s say I have 5 dimensions and 4 facts, and I had a failure loading a fact table. I don’t want to reload my dimensions, and I only want to reload the failed facts. How do we accomplish this?

Read on to learn how.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Category: ETL / ELT

Data Transformation with Dataflows Gen2

Metadata-Driven Spark Clusters in Azure Databricks

Mounding ADF Instances in Microsoft Fabric

AWS DMS and a LOB Bug

Setting a Default Destination for Fabric Dataflows Gen2

Execute a Collection of Child Pipelines from Metadata in Data Factory

Move Data between Lakehouses and Workspaces in Microsoft Fabric

Dynamically Start a Collection of Child Pipelines in Fabric Data Factory

Dynamically Start a Child Pipeline in Fabric Data Factory

Restarting Failed Control Flows in Azure Data Factory