Press "Enter" to skip to content

Category: ETL / ELT

Durable Azure Functions and Azure Data Factory

Rayis Imayev wants to use Azure Functions with Azure Data Factory:

Ok, here is my problem: I have an Azure Data Factory (ADF) workflow that includes an Azure Function call to perform external operations and returns output result, which in return is used further down my ADF pipeline. My ADF workflow (1) depends on the output result of the Azure Function call; (2) plus a time efficiency of the Azure Function call is another factor to consider, if its time execution hits 230 seconds or more, ADF Azure Function will fail with a time-out error message and my workflow is screwed.

This gave Rayis the impetus to try out durable functions. Read on to see how that worked out.

Comments closed

Metadata-Driven ADF Pipelines for Synapse

Hope Foley is back:

So what’s the hard thing I want to help make easier in this post?  Metadata driven pipelines in Azure Data Factory!  I had the opportunity awhile back to work with a customer who was pulling data out of large SQL Servers to eventually land data into Azure Synapse SQL pools back when they were still Azure SQL DW.  We created a couple load pattern pipelines that used metadata in Azure SQL DB to load Synapse sql pool tables from parquet files in Azure Data Lake Storage (ADLS) Gen 2. 

Not gonna lie, the pipelines weren’t easy for me to learn to setup initially.  Big thanks to Catherine for your blog which was a life preserver in the hardest parts!  So I wanted to see if I could automate it in my old friend PowerShell. 

It would also be worth looking at some of the work Paul Andrew has done around ADF.procfwk for another approach to the problem.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed

Recursive Metadata Discovery in Azure Data Factory

Richard Swinbank gives us one method to perform recursive metadata discovery in Azure Data Factory:

Azure Data Factory’s Get Metadata activity returns metadata properties for a specified dataset. In the case of a blob storage or data lake folder, this can include childItems array – the list of files and folders contained in the required folder. If you want all the files contained at any level of a nested a folder subtree, Get Metadata won’t help you – it doesn’t support recursive tree traversal. In this post I try to build an alternative using just ADF.

But before you get too invested in this technique, please read Richard’s spoiler.

Comments closed

Passing an Array of Arrays as a Parameter in Azure Data Factory

Rayis Imayev has a list for us:

In my previous blog post – Setting default values for Array parameters/variables in Azure Data Factory, I had helped myself to remember that arrays could be passed as parameters to my Azure Data Factory (ADF) pipelines. This time I’m helping myself to remember that an array of other arrays can also exist as ADF pipeline parameters’ values.

Read on for the example.

Comments closed

Automation Tips for ETL

Richard Swinbank shares some tips around automating ETL processes:

I spent a good part of my career in the UK National Health Service (NHS). We often talk about data engineering problems in terms of handling large volumes of data, but the challenge in the NHS was frequently complexity – even a small acute hospital might have a dozen or more separate systems, running on various different DBMS backends, with many tables of interest for management or central reporting.

You might need to extract data from hundreds of (mostly small) tables to populate your data warehouse – my challenge was to make this as quick and easy to implement as possible.

Click through for the tips.

Comments closed

Data Lineage and SSIS

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.

While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed

Data Importation and Exportation with dbatools

Mikey Bronowski continues a series on dbatools functionality vis-a-vis SQL Server Management Studio:

The SSMS offers to script out lots of the SQL Server objects, however it can be limited in some areas. Using Get-Dba* commands and piping them into Export-DbaScript may add few more options. For example SQL Agent jobs:

Click through for just shy of a dozen cmdlets to help you run your data import-export business.

Comments closed

ADF Data Flows and Joins Failing During Debugging

Mark Kromer clears up some issues around debugging in Azure Data Factory:

One of the important features built into ADF is the ability to quickly preview your data while designing your data flows and to execute the finished product against a sampling of data prior to finalizing and operationalizing your pipelines.

However, there are a few fundamentals relative to working with Joins that you should keep in mind and a few details below are important to understand at design time and while debugging / testing.

The answer makes sense but it would not have been the first thing to come to mind for me.

Comments closed

Transforming JSON to CSV: ADF vs Databricks

Rayis Imayev compares two methods of transforming a JSON-structured data set into a CSV:

There is a well known and broadly advertised message from Microsoft that Azure Data Factory (ADF) is a code-free environment to help you to create your data integration solutions – https://azure.microsoft.com/en-us/resources/videos/microsoft-azure-data-factory-code-free-cloud-data-integration-at-scale/. I agree and support this approach of using drag and drop visual UI to build and automate data pipelines without writing code. However, I’m also interested to try if I can recreate certain ADF operations by writing code, just out of my curiosity.

Rayis includes a link to the Azure Data Factory step-by-step demonstration and then kicks it up a notch with Databricks. Read on to see how the two compare.

Comments closed