ETL / ELT – Page 26 – Curated SQL

Case-Insensitive Power Query Merges

Published 2020-04-28 by Kevin Feasel

Ed Hansberry takes us through a pain point in Power Query:

Power Query is notorious for being case sensitive. Even its language is case sensitive. Often though you get data from users where they are using different cases for the same data. Some never use the shift key, and others CAPSLAP everything.

Click through for your two options, plus a bonus option which might work.

Comments closed

Scheduling SSIS Packages in Azure

Published 2020-04-28 by Kevin Feasel

Magi Naumova takes us through the process of running SSIS in Azure Data Factory, including the scheduling of jobs to run our SSIS packages:

The main purpose of these tools is to force the Lift and Shift approach of migrating and running existing SSIS Packages in Azure. I wouldn’t say that this is the most effective approach of transferring the ETL to Azure, but it could be a good start on a road of a Modern Azure Datawarehouse Architecture. If you have already deployed SSIS packages in Azure SSIS Catalog, then SSMS 18 helps you to put them on schedule very quickly.
Running SSIS Packages in Azure requires provisioning of SSIS Runtime Engine, an Azure Data Factory instance and a SQL Database which hosts the SSIS catalog. Scheduling SSIS Packages in Azure requires creating a data flow pipeline in ADF which has a trigger defined for scheduled execution. While describing all those concepts is far above the scope of this chapter, a short description would be useful.

Read on for a good amount of detail and a demo which walks through the process.

Comments closed

Unioning Datasets in Azure Data Factory

Published 2020-04-24 by Kevin Feasel

Rayis Imayev takes us through the Union Transformation in Mapping Data Flows:

Recently I had a chance to work on a special case of creating a file where the first two rows were static and contained output table metadata (columns names and column types), assuming that further reading process would be able to explicitly identify each column data type and consume new file correctly in a single run.

Read on to see how Rayis solved this with Azure Data Factory.

Comments closed

Using Azure Functions Inside Azure Data Factory

Published 2020-04-20 by Kevin Feasel

Rayis Imayev shows how you can call an Azure Function from inside your Azure Data Factory Pipeline:

Creating a data solution with Azure Data Factory (ADF) may look like a straightforward process: you have incoming datasets, business rules of how to connect and change them and a final destination environment to save this transformed data. Very often your data transformation may require more complex business logic that can only be developed externally (scripts, functions, web-services, databricks notebooks, etc.).

In this blog post, I will try to share my experience of using Azure Functions in my Data Factory workflows: my highs and lows of using them, my victories and struggles to make them work.

This includes a description of the options, a demo function, and additional notes for each technique.

Comments closed

Metadata Integrity Checks in ADF.ProcFwk

Published 2020-04-10 by Kevin Feasel

Paul Andrew has another update to the ADF metadata-driven processing framework:

With this release of the framework I wanted to take the opportunity to harden the database and add some more integrity (intelligence) to the metadata, things that go beyond the existing database PK/FK constraints. After all, this metadata drives everything that Azure Data Factory does/is about to do – so it needs to be correct. These new integrity checks take two main forms:
1. Establishing a minimum set of criteria within the metadata before the core Data Factory processing starts and creates an execution run.
2. Establishing a logical chain of pipeline dependencies across processing stages. Then providing a set of advisory checks for area’s of conflict and/or improvement.
More details on both are included against the actual stored procedure in the database changes section below.
In addition to database hardening, I’ve added a few other bits to the solution, including a PowerShell script for ADF deployments and a Data Studio Notebook to make the developer experience of implementing this code project a little nicer.

Read on to see what’s in version 1.3. Check it out on GitHub as well.

Comments closed

Switching Data Factories in Azure Data Factory

Published 2020-04-08 by Kevin Feasel

Koen Verbeeck takes us through switching the current environment in Azure Data Factory:

When working with Azure Data Factory, it’s possible you have multiple ADF environments. For example, you can have one for dev, one for test and one for production. Unfortunately, the functionality to switch between them is a bit hidden in the user interface.

Read on to see how.

Comments closed

Checking JSON Structure with ADF

Published 2020-04-06 by Kevin Feasel

Rayis Imayev takes us through the solution of a tricky problem in Azure Data Factory:

Within my “ForEach” container I have also placed a Stored Procedure task and set 4 data elements from my incoming data stream as values for corresponding parameters.
However this approach will not work for all my incoming JSON events, it actually failed for the last one, since it didn’t have both “stop_time” and “last_update” data elements.
An easy way to fix this problem is to add missing data elements with empty values for the last event record, however, when we don’t have control over incoming data, we need to adjust our data processing steps.

Read on to see how Rayis solves this problem.

Comments closed

Tying Azure Data Factory to Source Control

Published 2020-04-06 by Kevin Feasel

Eddy Djaja explains why you really want to tie Azure Data Factory to your source control:

Azure Data Factory (ADF) is Microsoft’s ETL or more precise: ELT tool in the cloud. For more information of ADF, Microsoft puts the introduction of ADF in this link: https://docs.microsoft.com/en-us/azure/data-factory/introduction. As some have argued if ADF will replace or complement the “on-premise” SSIS, it is uncertain and only time can tell what will happen in the future.
Unlike SSIS, the authoring of ADF does not use Visual Studio. ADF authoring uses a web browser to create ADF components, such as pipelines, activities, datasets, etc. The simplicity of authoring ADF may confuse the novice developers on how ADF components are saved, stored and published. When logging to ADF for the first time after creating an ADF, the authoring is in the ADF mode. How do we know?

Click through for the explanation and some resources on how to do it.

Comments closed

Auto-Detecting Column Delimiters with Data Factory

Published 2020-04-02 by Kevin Feasel

Mark Kromer shows us a way of dynamically learning what the likely delimiter of a delimited file is:

Processing delimited text files in the data lake is one of the most popular uses of Azure Data Factory (ADF). To define the field delimiter, you set the column delimiter property in an ADF dataset.
The reality of data processing is that delimiter can change often. ADF provides a facility to account for this data drift via parameterization. However, this assumes that you know that the delimiter is changing and what it will change to.
I’m going to briefly describe a sample of how to auto-detect a file delimiter using ADF Data Flows.

Click through for the demo.

Comments closed

Transforming JSON to CSV with Azure Data Factory

Published 2020-03-27 by Kevin Feasel

Rayis Imayev shows how to use the Flatten task with Azure Data Factory wrangling data flows:

Last week I blogged about using Mapping Data Flows to flatten sourcing JSON file into a flat CSV dataset:
Part 1: Transforming JSON to CSV with the help of Flatten task in Azure Data Factory
Today I would like to explore the capabilities of the Wrangling Data Flows in ADF to flatten the very same sourcing JSON dataset.

Click through to see what’s different.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: ETL / ELT