Press "Enter" to skip to content

Category: ETL

The ETLT Pattern

Abe Dearmer bridges the gap:

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case. 

The idea can certainly be useful.

Comments closed

Check if an ADF Pipeline is Already Running

Paul Andrew has a scenario for us:

Scenario: I want to trigger a Data Factory pipeline, but when I do I want the pipeline to know if it’s already running. If it is already running, stop the new run.

Sounds simple enough right?

Wrong!

But, now simple for you, because I’ve done it for you, yay! 🙂

I thought it was simple, but it wasn’t simple, but now it’s simple, but is it really simple? Click through to find out.

Comments closed

Auto-Checking Azure Data Factory Setup

Paul Andrew is at it again:

Building on the work done and detailed in my previous blog post (Best Practices for Implementing Azure Data Factory) I was tasked by my delightful boss to turn this content into a simple check list of what/why that others could use…. I slightly reluctantly did so. However, I wanted to do something better than simply transcribe the previous blog post into a check list. I therefore decided to breakout the Shell of Power and attempt to automate said check list.

Sure, a check list could be picked up and used by anyone – with answers manually provided by the person doing the inspection of a given ADF resource. But what if there was a way to have the results given to you a plate and inferring things that aren’t always easy to spot via the Data Factory UI.

Paul uses an ARM template rather than hitting your Data Factory directly, so there’s a little bit more work for you the user, but Paul explains why it’s both necessary and proper.

Comments closed

Automating Hadoop Workflows with Spark and Oozie

Prashanth Jayaram walks us through automating a sample data transfer with tools like Sqoop, Spark, and Oozie:

In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.

Components involved in Spark implementation:

1. Initialize spark session using scala program
2. Ingest data from data lake through hive queries
3. Apply business logic using scala constructs or hive queries
4. Load data into HDFS or Hive targets
5. Execute spark programs through spark submit

Read on for a sample flow.

Comments closed

Migrating SSIS to Azure Data Factory

Koen Verbeeck has some articles for us:

For quite some time now, there’s been the possibility to lift-and-shift your on-premises SSIS project to Azure Data Factory. There, they run in an Integration Runtime, a cluster of virtual machines that will execute your SSIS packages. In the beginning, you only had the option to use the project deployment model and host your SSIS catalog in either an Azure SQL DB, or in a SQL Server Managed Instance.

But over time, features were added and now the package deployment model has been supported for quite some time as well. Even more, the “legacy SSIS package store” is also supported. For those who still remember this, it’s the SSIS service where you can log into with SSMS and see which packages are stored in the service (either the file system or the MSDB database) and which are currently running.

Read on for much more detail on the topic.

Comments closed

Durable Azure Functions and Azure Data Factory

Rayis Imayev wants to use Azure Functions with Azure Data Factory:

Ok, here is my problem: I have an Azure Data Factory (ADF) workflow that includes an Azure Function call to perform external operations and returns output result, which in return is used further down my ADF pipeline. My ADF workflow (1) depends on the output result of the Azure Function call; (2) plus a time efficiency of the Azure Function call is another factor to consider, if its time execution hits 230 seconds or more, ADF Azure Function will fail with a time-out error message and my workflow is screwed.

This gave Rayis the impetus to try out durable functions. Read on to see how that worked out.

Comments closed

Metadata-Driven ADF Pipelines for Synapse

Hope Foley is back:

So what’s the hard thing I want to help make easier in this post?  Metadata driven pipelines in Azure Data Factory!  I had the opportunity awhile back to work with a customer who was pulling data out of large SQL Servers to eventually land data into Azure Synapse SQL pools back when they were still Azure SQL DW.  We created a couple load pattern pipelines that used metadata in Azure SQL DB to load Synapse sql pool tables from parquet files in Azure Data Lake Storage (ADLS) Gen 2. 

Not gonna lie, the pipelines weren’t easy for me to learn to setup initially.  Big thanks to Catherine for your blog which was a life preserver in the hardest parts!  So I wanted to see if I could automate it in my old friend PowerShell. 

It would also be worth looking at some of the work Paul Andrew has done around ADF.procfwk for another approach to the problem.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed

Recursive Metadata Discovery in Azure Data Factory

Richard Swinbank gives us one method to perform recursive metadata discovery in Azure Data Factory:

Azure Data Factory’s Get Metadata activity returns metadata properties for a specified dataset. In the case of a blob storage or data lake folder, this can include childItems array – the list of files and folders contained in the required folder. If you want all the files contained at any level of a nested a folder subtree, Get Metadata won’t help you – it doesn’t support recursive tree traversal. In this post I try to build an alternative using just ADF.

But before you get too invested in this technique, please read Richard’s spoiler.

Comments closed