Press "Enter" to skip to content

Category: ETL

Migrating From Cosmos DB to SQL Server

Eitan Blumin builds an app:

The general idea is this:

The app executes a Cosmos DB query and collects a number of records into its “buffer”.

Once that “buffer” reaches a certain number of records (configurable), it’s time to “flush” it into the SQL Server. That could be either a database table receiving a Bulk Copy stream, or a stored procedure receiving a table valued parameter (again, configurable).

After the buffer is flushed, we have the option to execute a “merge” procedure. This is a stored procedure that would implement an “upsert” logic from the “staging” table and into the actual destination table.

Read on for more explanation and check out Eitan’s GitHub repo.

Leave a Comment

Deploying ADF via Azure DevOps

Kamil Nowinski has part two on a series about releasing Azure Data Factory code:

Struggling with #ADF deployment? adf_publish branch doesn’t suit your purposes? Don’t have skills with PowerShell? I have good news for you. There is a new tool in the market. It’s a task for Azure DevOps Release Pipeline to deploy whole ADF from code (JSON files) to ADF instance in Azure. Behind the scenes, it runs the PowerShell module which does all job for you.
Sounds unbelievable? But it’s real! Check it out for yourself.

Click through for the video.

Leave a Comment

ADF.Procfwk Version 1.8

Paul Andrew has been busy:

Following more great feedback from the Data Platform community the primary goal of this release was to further improve the resilience of the framework processing. These improvements included its restart clean up capabilities and introducing better dependency chain handling between Worker pipelines when failures occur. The latter builds on the existing restart functionality first introduced in release v1.2 and supplements the logic using a new set of pipeline dependency metadata. I’ve created the below visual to conceptually show the new dependency chain behaviour, should you wish to populate and make use of the new metadata handling.

Read on for the full changelog.

Leave a Comment

Publishing Azure Data Factory via Azure DevOps

Kamil Nowinski shares how to deploy Azure Data Factory flows via Azure DevOps:

Struggling with #ADF deployment? adf_publish branch doesn’t suit your purposes? Don’t have skills with PowerShell? I have good news for you. There is a new tool in the market. It’s a task for Azure DevOps Release Pipeline to deploy whole ADF from code (JSON files) to ADF instance in Azure. Behind the scenes, it runs the PowerShell module which does all job for you.
Sounds unbelievable? But it’s real! Check it out for yourself.

Click through for a video.

Leave a Comment

Custom Parameters in Azure Data Factory Deployments

Rayis Imayev shows us how to use customer parameters in ARM templates when deploying Azure Data Factory pipelines:

If I needed to visually explain how this custom parameterization works for Azure Data Factory resource, I would picture it this way. Before you solely relied on publishing your ADF code from your collaboration Git branch to the adf_publish branch where ARMTemplateForFactory.jsonand ARMTemplateParametersForFactory.json files live and get further deployed to other environments. You had some flexibility to parameterize your deployment or run some custom code to update ARM templates before they get deployed.

With the introduction of the ADF custom parameterization, you have an additional JSON file arm-template-parameters-definition.json that you can use to define rules to add supplementary parameters to the main ARMTemplateParametersForFactory.json file. There is a very important statement on Microsoft documentation site that explains how this new file operates, “A definition can’t be specific to a resource instance. Any definition applies to all resources of that type”. It’s like a garden rake that will collect all the leaves or none, i.e. if your rule defines a JSON property, let’s say “timeout” of your ForEach loop container, then all timeouts will be scooped into ARM template parameter file.

Read on for the full explanation as well as an example.

Leave a Comment

Calculating Test Coverage of Azure Data Factory Pipelines

Richard Swinbank wraps up a series on testing in Azure Data Factory:

To determine which activities have been executed by a test suite, I need to collect and aggregate activity run data from every pipeline execution triggered from any test fixture. In the previous post I developed components to retrieve and cache activities for a pipeline run – I’ll use those components here to collect data systematically.

I’m going to create a new helper class to contain functions specific to coverage measurement. It’s a subclass of the database helper because I want to exploit functionality from classes further up the hierarchy:

Read on for the code and process for measurement.

Leave a Comment

Building Data Pipelines with Apache NiFi

The Hadoop in Real World team takes a look at Apache NiFi:

NiFi is an easy to use tool which prefers configuration over coding.

However, NiFi is not limited to data ingestion only. NiFi can also perform data provenance, data cleaning, schema evolution, data aggregation, transformation, scheduling jobs and many others. We will discuss these in more detail in some other blog very soon with a real world data flow pipeline. 

Hence, we can say NiFi is a highly automated framework used for gathering, transporting, maintaining and aggregating data of various types from various sources to destination in a data flow pipeline.

Click through for an example with instructions. The feeling is pretty close to Informatica or SQL Server Integration Services, so if you’re an old hand at one of those, you’ll get into this pretty easily.

Comments closed

Breaking out of Azure Data Factory ForEach Activities

Andy Leonard is planning a jailbreak:

“What if something fails inside the ForEach activity’s inner activities, Andy?”

That is an excellent question! I’m glad you asked. The answer is: The ForEach activity continues iterating.

I can now hear some of you asking…

“What if I want the ForEach activity to fail when an inner activity fails, Andy?”

Another excellent question, and you’ve found a post discussing one way to “break out” of a ForEach activity’s iteration.

Read on for the process.

Comments closed

Securing the Data Prep Area

Tim Mitchell explains why you should limit access to your staging area:

First things first, let’s define what a data prep area is. Data preparation (prep) is a common phase of extract, transform, and load (ETL) operations in which data is temporarily written for cleansing, deduplication, reshaping, or other data modifications. Also sometimes referred to as a landing area or a staging area, this is a common design pattern when moving data from a data store optimized for online transaction processing (OLTP) to a data model more friendly to analytics or reporting.

The data prep area really is a lot like a restaurant kitchen: it’s sometimes chaotic, it’s not consumer friendly, and there is a legitimate risk of consuming half-prepared goods.

Tim lays out why that is, so check it out.

Comments closed