Press "Enter" to skip to content

Category: ETL / ELT

Loading Data from On-Premises SQL Server into Microsoft Fabric

Reitse Eskens spends an hour or so:

In my previous blogs, I’ve written about Fabric and all the cool things it can do. Thing is, my load tests were based on files. Either CSV or Delta. But in reality, a lot of data comes from an on-premises database server. In reality, you might connect to a SQL 2008 instance or maybe even older. Truth be told, I haven’t got an instance in that version/edition around anymore. So I had to use SQL Server 2019, a version I’m encountering more often nowadays.

For this blog, it won’t make much sense to create a humongous database and try to get all the data in. Fabric will cope, the major issue (in my experience) is the internet connection between my local database and the Fabric environment. One thing I’m really curious about is if Fabric will have the Link capability that was introduced for Synapse Analytics and SQL Server 2022.

There’s no Link capability currently available, so Reitse does the next-best thing and uses Fabric pipelines.

Comments closed

A Complex Example of ADF Pipeline Return Value

Andy Leonard goes beyond the simple example:

In this post, I demonstrate one way to create a child pipeline that returns the SubscriptionId for a data factory. I then call the child pipeline from a parent package.

To build this demonstration, please follow the instructions that follow.

This is definitely more complicated than Andy’s simple example, but there are plenty of screenshots to take you through the process.

Comments closed

A Simple Example of ADF Pipeline Return Value

Andy Leonard starts easy:

I want to develop an Azure Data Factory (ADF) design pattern for calling focused, unit-of-work, function-y ADF pipelines that perform focused tasks. Some of these “worker pipelines” will need to return values to the calling pipeline.

In this example, I started by reading Mark Kromer‘s (excellent) article titled You can now customize the return value from your pipeline! I then crafted the simple example shown in this post to make sure I understood the principles involved before using pipeline return value (preview) functionality in more robust ADF patterns.

Follow the steps I outline below to build a simple example for an ADF pipeline that returns a value!

Click through to follow those steps.

Comments closed

Data Pipelines and Dataflows in Fabric Data Factory

Reza Rad has two videos and posts for us. First up is a primer on data pipelines in Microsoft Fabric Data Factory:

The Pipeline comes from Azure Data Factory. A Pipeline is a group of activities bundled together into a workflow. For example, a Pipeline can generate a process around the Dataflow. For example, you may want to run a Dataflow in a loop until something happens, and with the failure or success of each execution, you want to perform a task such as sending out an email, copying data somewhere, running a stored procedure, etc.

Reza then gets into Dataflows:

Through the years, the Data Transformation engines evolved. In the past, much coding was involved, and the user interface was not the best experience. These days, most actions can be done through pre-built transformations; less coding is needed, and a hardcore developer is not needed for preliminary tasks. This enables citizen data engineers to work with these tools.

Power Query is the data transformation engine of the new generation of Microsoft Data Integration tools and services. Power Query is the data transformation engine used in Power BI. However, Power Query can be used as a standalone cloud-based data transformation service when it is used as Dataflow. Dataflow is the ETL in the cloud offered by Microsoft, which uses the Power Query engine.

Comments closed

Fixing ORA-26086 in Azure Data Factory Pipelines

Emanuele Meazzo fixes a problem:

Turns out, ADF is rightfully trying to insert the rows in bulk, but Oracle doesn’t like it when the sink table has triggers, falling back to row by row insertion seems it’s too much to ask, so you end up with the error.
Searching on the good ‘ol wide web you’ll encounter this solution, that basically tells you to disable bulkload for the whole Oracle connection by setting EnableBulkLoad=0 in the connection string.
That wouldn’t work for me, because I’m bulk loading just fine everywhere else, so either I had to suffer slower performance on all the other inserts by disabling the bulk insert, or I had to create an additional and separate Linked Service to Oracle with the bulk insert disabled: doable but adding maintenance overhead when things change (two secrets to manage instead of one).

My solution? Super dumb.

Read on for a copy of the error message and Emanuele’s solution. There are two philosophies with regard to dumb solutions:

  • If a solution is dumb and it works, it isn’t dumb.
  • If a solution is dumb and it works, it’s still a dumb solution and you just got lucky this time.

I’m not quite sure under which category this falls.

Comments closed

Adding Microseconds to ADF Timestamps

Rayis Imayev can’t wait for the next second:

The current addToTime function (https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions#addToTime) in Azure Data Factory (ADF) only supports a specific set of time units ranging from Year to Seconds. Since I needed to increment a timestamp by microseconds, I had to find an alternative solution in ADF. Here are my findings on how to do this using an alternate approach.

Click through for Rayis’s solution to the problem.

Comments closed

Unrolling Multiple Arrays in Azure Data Factory

Mark Kromer puts us in disarray:

ADF and Synapse data flows gave a Flatten transformation to make it easy to unroll an array as part of your data transformation pipelines. We’ve updated the Flatten transformation to now allow for multiple arrays that can be unrolled in a single transformation step. This will make your ETL jobs much simpler with fewer transformation steps.

Click through for screenshots showing how to use this feature.

Comments closed

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Automating Self-Hosted Integration Runtime Deployment

Jonathan D’Aloia doesn’t want to click next-next-next:

Welcome to my blog on how to fully automate the deployment of a Self-Hosted Integration Runtime using Terraform!

The title of this blog is very much self-explanatory but I hope you find the contents useful and are able to apply this on your projects in some aspect.

Click through for a brief overview of self-hosted integration runtimes, the process to follow, and a link to the repo.

Comments closed

Data Pipelines and Data Mesh

Jean-Georges Perrin answers a burning question:

I keep having questions about data pipelines. Data pipelines in Data Mesh is a topic I should tackle. So… Is the data pipeline the root of all evil?

Jean-Georges’s answer is quite in line with one of my favorite phrases: “Short answer: no, with an ‘if’; long answer: yes, with a ‘but.'” Read on for some thoughts on data pipelines and what the data mesh concept does to minimize harm.

Comments closed