Press "Enter" to skip to content

Category: ETL / ELT

Updates to Change Data Capture in ADF

Chen Hirsh looks at some updates:

A few months ago I wrote a post about the new feature of change data capture (CDC) on Azure data factory (ADF) – https://www.madeiradata.com/post/the-wind-of-change-change-data-capture-in-data-factory

Change data capture, as the name suggests, gets the data changes on one system, and replicates them to another. Since this is a task that data engineers do a lot, this was a very welcome addition to ADF.

In this post, we’ll explore what is new on this front.

Click through for what’s new, though do be cognizant of which items are in GA and which are still in preview.

Comments closed

Using the Azure Data Factory Self-Hosted Integration Runtime

Chen Hirsh hosts a runtime:

In Azure data factory (ADF), An integration runtime is a compute resource to run your pipelines on. When you run an application on your computer, it uses the computer resources, such as CPU and memory, to run its tasks. When you run activities in a pipeline in ADF, they also need resources to do their job, like copying data or writing a file, and these are provided by the integration runtime.

When you create an instance of ADF, you get a default integration runtime, hosted in the same region that you created ADF in. If you need, you can add your own integration runtimes, either on Azure, or you can download and install a self-hosted integration runtime (SHIR) on your own server.

Read on to understand when you would want to use a self-hosted integration runtime and the process to do so. This SHIR also applies to Synapse pipelines and is one of the few ways to move data out of a Synapse workspace with data exfiltration protection enabled.

Comments closed

Fabric Data Integration

Teo Lachev reviews the primary methods for data ingestion in Microsoft Fabric:

Fabric supports three options for automated data integration: Data Pipeline (Azure Data Factory pipeline), Dataflow Gen2 (Power BI dataflow), and Notebook (Spark). I summarize these three options in the following table, which loosely resembles the Microsoft comparison table but with my take on it.

Read on for Teo’s thoughts on the matter.

Comments closed

Listing Available Properties in Azure Data Factory

Andy Leonard builds a list:

Did you know Azure Data Factory (ADF) will actually list available properties? It will. One of the things I cover in my ADF training titled Master the Fundamentals of Azure Data Factory is this handy troubleshooting tip.

Read on to see how, though I’d personally like something which is a bit faster than waiting for the thing to execute and getting back what my choices are.

Comments closed

Loading Data from On-Premises SQL Server into Microsoft Fabric

Reitse Eskens spends an hour or so:

In my previous blogs, I’ve written about Fabric and all the cool things it can do. Thing is, my load tests were based on files. Either CSV or Delta. But in reality, a lot of data comes from an on-premises database server. In reality, you might connect to a SQL 2008 instance or maybe even older. Truth be told, I haven’t got an instance in that version/edition around anymore. So I had to use SQL Server 2019, a version I’m encountering more often nowadays.

For this blog, it won’t make much sense to create a humongous database and try to get all the data in. Fabric will cope, the major issue (in my experience) is the internet connection between my local database and the Fabric environment. One thing I’m really curious about is if Fabric will have the Link capability that was introduced for Synapse Analytics and SQL Server 2022.

There’s no Link capability currently available, so Reitse does the next-best thing and uses Fabric pipelines.

Comments closed

A Complex Example of ADF Pipeline Return Value

Andy Leonard goes beyond the simple example:

In this post, I demonstrate one way to create a child pipeline that returns the SubscriptionId for a data factory. I then call the child pipeline from a parent package.

To build this demonstration, please follow the instructions that follow.

This is definitely more complicated than Andy’s simple example, but there are plenty of screenshots to take you through the process.

Comments closed

A Simple Example of ADF Pipeline Return Value

Andy Leonard starts easy:

I want to develop an Azure Data Factory (ADF) design pattern for calling focused, unit-of-work, function-y ADF pipelines that perform focused tasks. Some of these “worker pipelines” will need to return values to the calling pipeline.

In this example, I started by reading Mark Kromer‘s (excellent) article titled You can now customize the return value from your pipeline! I then crafted the simple example shown in this post to make sure I understood the principles involved before using pipeline return value (preview) functionality in more robust ADF patterns.

Follow the steps I outline below to build a simple example for an ADF pipeline that returns a value!

Click through to follow those steps.

Comments closed

Data Pipelines and Dataflows in Fabric Data Factory

Reza Rad has two videos and posts for us. First up is a primer on data pipelines in Microsoft Fabric Data Factory:

The Pipeline comes from Azure Data Factory. A Pipeline is a group of activities bundled together into a workflow. For example, a Pipeline can generate a process around the Dataflow. For example, you may want to run a Dataflow in a loop until something happens, and with the failure or success of each execution, you want to perform a task such as sending out an email, copying data somewhere, running a stored procedure, etc.

Reza then gets into Dataflows:

Through the years, the Data Transformation engines evolved. In the past, much coding was involved, and the user interface was not the best experience. These days, most actions can be done through pre-built transformations; less coding is needed, and a hardcore developer is not needed for preliminary tasks. This enables citizen data engineers to work with these tools.

Power Query is the data transformation engine of the new generation of Microsoft Data Integration tools and services. Power Query is the data transformation engine used in Power BI. However, Power Query can be used as a standalone cloud-based data transformation service when it is used as Dataflow. Dataflow is the ETL in the cloud offered by Microsoft, which uses the Power Query engine.

Comments closed

Fixing ORA-26086 in Azure Data Factory Pipelines

Emanuele Meazzo fixes a problem:

Turns out, ADF is rightfully trying to insert the rows in bulk, but Oracle doesn’t like it when the sink table has triggers, falling back to row by row insertion seems it’s too much to ask, so you end up with the error.
Searching on the good ‘ol wide web you’ll encounter this solution, that basically tells you to disable bulkload for the whole Oracle connection by setting EnableBulkLoad=0 in the connection string.
That wouldn’t work for me, because I’m bulk loading just fine everywhere else, so either I had to suffer slower performance on all the other inserts by disabling the bulk insert, or I had to create an additional and separate Linked Service to Oracle with the bulk insert disabled: doable but adding maintenance overhead when things change (two secrets to manage instead of one).

My solution? Super dumb.

Read on for a copy of the error message and Emanuele’s solution. There are two philosophies with regard to dumb solutions:

  • If a solution is dumb and it works, it isn’t dumb.
  • If a solution is dumb and it works, it’s still a dumb solution and you just got lucky this time.

I’m not quite sure under which category this falls.

Comments closed