Press "Enter" to skip to content

Category: ETL / ELT

When to Have Multiple Azure Data Factories

Paul Andrew explains how to become a factory mogul:

The obvious and easy reason for having multiple Data Factory’s could be that you simply want to separate your business processes. Maybe they all have separate data delivery requirements and it just makes management of data flows easier to handle. For example:

– Sales
– Finance
– HR

They could have different data delivery deadlines, they process on different schedules and don’t share any underlying connections.

You may also have multiple projects underway that mean you want to keep teams isolated.

But that’s not the only reason, so click through to learn several other reasons why you might have multiple Azure Data Factory instances running.

Comments closed

Using the SSIS Hadoop Components

Hadi Fadlallah walks us through the HDFS file source and destination components:

To test these components, we will create an SSIS package and add three connection managers:

1. Hadoop Connection Manager: to connect with the Hadoop cluster (check the previous article)
2. OLE DB Connection Manager: to connect to SQL Server instance where AdventureWorks2017 database is stored
3. Flat File Connection Manager: We will use it to export data from HDFS Source:

I wonder if they ever fixed the 4K screen resolution problem (kind of tells you how often I use SSIS anymore…). That was one of the things which made these components unusable for me on any modern screen.

Comments closed

Replicating SQL’s IN Operator with Azure Data Factory

Rayis Imayev shows how we can find values in a group using Azure Data Factory:

However only this use-case for the OR function with 2 condition could be possible:or(equals(variables(‘var1’), ‘A’), equals(variables(‘var1’), ‘B’)) – limit of two conditions

But what if we have an ability to check if a particular element variable/parameter/other ADF object value belongs to a range of values (array of value), similarly to what we can do with the IN operator in SQL language, this would definitely solve our problem and remove the limitation of logical conditions to check.

Click through for the answer.

Comments closed

Scheduling SSIS Packages in Azure

Magi Naumova takes us through the process of running SSIS in Azure Data Factory, including the scheduling of jobs to run our SSIS packages:

The main purpose of these tools is to force the Lift and Shift approach of migrating and running existing SSIS Packages in Azure. I wouldn’t say that this is the most effective approach of transferring the ETL to Azure, but it could be a good start on a road of a Modern Azure Datawarehouse Architecture. If you have already deployed SSIS packages in Azure SSIS Catalog, then SSMS 18 helps you to put them on schedule very quickly.

Running SSIS Packages in Azure requires provisioning of SSIS Runtime Engine, an Azure Data Factory instance and a SQL Database which hosts the SSIS catalog. Scheduling SSIS Packages in Azure requires creating a data flow pipeline in ADF which has a trigger defined for scheduled execution. While describing all those concepts is far above the scope of this chapter, a short description would be useful.

Read on for a good amount of detail and a demo which walks through the process.

Comments closed

Unioning Datasets in Azure Data Factory

Rayis Imayev takes us through the Union Transformation in Mapping Data Flows:

Recently I had a chance to work on a special case of creating a file where the first two rows were static and contained output table metadata (columns names and column types), assuming that further reading process would be able to explicitly identify each column data type and consume new file correctly in a single run.

Read on to see how Rayis solved this with Azure Data Factory.

Comments closed

Using Azure Functions Inside Azure Data Factory

Rayis Imayev shows how you can call an Azure Function from inside your Azure Data Factory Pipeline:

Creating a data solution with Azure Data Factory (ADF) may look like a straightforward process: you have incoming datasets, business rules of how to connect and change them and a final destination environment to save this transformed data. Very often your data transformation may require more complex business logic that can only be developed externally (scripts, functions, web-services, databricks notebooks, etc.).

In this blog post, I will try to share my experience of using Azure Functions in my Data Factory workflows: my highs and lows of using them, my victories and struggles to make them work.

This includes a description of the options, a demo function, and additional notes for each technique.

Comments closed

Metadata Integrity Checks in ADF.ProcFwk

Paul Andrew has another update to the ADF metadata-driven processing framework:

With this release of the framework I wanted to take the opportunity to harden the database and add some more integrity (intelligence) to the metadata, things that go beyond the existing database PK/FK constraints. After all, this metadata drives everything that Azure Data Factory does/is about to do – so it needs to be correct. These new integrity checks take two main forms:

1. Establishing a minimum set of criteria within the metadata before the core Data Factory processing starts and creates an execution run.
2. Establishing a logical chain of pipeline dependencies across processing stages. Then providing a set of advisory checks for area’s of conflict and/or improvement.

More details on both are included against the actual stored procedure in the database changes section below.

In addition to database hardening, I’ve added a few other bits to the solution, including a PowerShell script for ADF deployments and a Data Studio Notebook to make the developer experience of implementing this code project a little nicer.

Read on to see what’s in version 1.3. Check it out on GitHub as well.

Comments closed

Checking JSON Structure with ADF

Rayis Imayev takes us through the solution of a tricky problem in Azure Data Factory:

Within my “ForEach” container I have also placed a Stored Procedure task and set 4 data elements from my incoming data stream as values for corresponding parameters.

However this approach will not work for all my incoming JSON events, it actually failed for the last one, since it didn’t have both “stop_time” and “last_update” data elements.

An easy way to fix this problem is to add missing data elements with empty values for the last event record, however, when we don’t have control over incoming data, we need to adjust our data processing steps.

Read on to see how Rayis solves this problem.

Comments closed