Press "Enter" to skip to content

Category: ETL / ELT

Building Data Pipelines with Apache NiFi

The Hadoop in Real World team takes a look at Apache NiFi:

NiFi is an easy to use tool which prefers configuration over coding.

However, NiFi is not limited to data ingestion only. NiFi can also perform data provenance, data cleaning, schema evolution, data aggregation, transformation, scheduling jobs and many others. We will discuss these in more detail in some other blog very soon with a real world data flow pipeline. 

Hence, we can say NiFi is a highly automated framework used for gathering, transporting, maintaining and aggregating data of various types from various sources to destination in a data flow pipeline.

Click through for an example with instructions. The feeling is pretty close to Informatica or SQL Server Integration Services, so if you’re an old hand at one of those, you’ll get into this pretty easily.

Comments closed

Breaking out of Azure Data Factory ForEach Activities

Andy Leonard is planning a jailbreak:

“What if something fails inside the ForEach activity’s inner activities, Andy?”

That is an excellent question! I’m glad you asked. The answer is: The ForEach activity continues iterating.

I can now hear some of you asking…

“What if I want the ForEach activity to fail when an inner activity fails, Andy?”

Another excellent question, and you’ve found a post discussing one way to “break out” of a ForEach activity’s iteration.

Read on for the process.

Comments closed

Securing the Data Prep Area

Tim Mitchell explains why you should limit access to your staging area:

First things first, let’s define what a data prep area is. Data preparation (prep) is a common phase of extract, transform, and load (ETL) operations in which data is temporarily written for cleansing, deduplication, reshaping, or other data modifications. Also sometimes referred to as a landing area or a staging area, this is a common design pattern when moving data from a data store optimized for online transaction processing (OLTP) to a data model more friendly to analytics or reporting.

The data prep area really is a lot like a restaurant kitchen: it’s sometimes chaotic, it’s not consumer friendly, and there is a legitimate risk of consuming half-prepared goods.

Tim lays out why that is, so check it out.

Comments closed

Including Headers in Zero-Row ADF Data Flows

Mark Kromer meets a challenge:

Today, we don’t have an option in data flows in ADF to include headers when your transformations result in zero rows. But you can build the logic to handle this scenario. So, until we add a checkbox feature to include headers, you can use this technique below to achieve this.

Click through for the explanation, as well as a completed version you can take for your own.

Comments closed

E-Mail Alerting in ADF.procfwk

Paul Andrew has an update to the Azure Data Factory Procedural Framework:

The primary goal of this release was to implement email alerting within the existing processing framework and using existing metadata driven practices to deliver this in an easy to control, flexible and granular way. That said, the following statements have been met in terms of alerting capabilities and design.

Read on for the full change list.

Comments closed

Unit Testing Azure Data Factory Pipelines

Richard Swinbank walks us through what it takes to run a unit test against an Azure Data Factory pipeline:

In part three of this series I looked at functional tests for ADF pipelines: verifying, in isolation, that pipelines are “doing things right”. In this post I’ll be testing isolated pipelines to check that they’re “doing the right things” – this is one description of a unit test. In a general-purpose programming language, unit tests might be used to verify that an individual line of code is executed, or that it has a particular effect. In Azure Data Factory, the smallest unit of development – a “line of code” – is a pipeline activity. I will be writing tests to verify that specific activities are executed (or not) and to inspect their results.

There’s a fair bit involved in this sort of test.

Comments closed

Bulk Loading SQL Server from .NET

Adrian Hills walks us through the SqlBulkCopy class:

Ever been in a situation where rumblings of “process X is too slow” suddenly build into a super-high priority ball of urgency when that next step up in data volume hits? Yeah, that can be fun. No, really, it can be fun because we have strategies to sort this stuff out, right?

In this blog post, I’m going to talk about one particular piece of functionality—SqlBulkCopy—that can help you with bulk data loading. If I had to single out my favorite .NET class, SqlBulkCopy would be at the top of the list. My goal is to introduce you to this class so that maybe it can become a part of your tool belt, too.

Click through to see how it works. If you’re familiar with SSIS, you’re already familiar with the concept if not the specifics.

Comments closed

Feeding Databricks Output to Azure SQL Database

Arun Sirpal takes us through the process of moving data from Databricks into Azure SQL Database:

Recently I got to a stage where I leveraged Databricks to the best of my ability to join couple of CSV files together, play around some aggregations and then output it back to a different mount point ( based on Azure Storage) as a parquet file, I decided that I actually wanted to move this data into Azure SQL DB, which you may want to do one day.

This isn’t just dropping files into Blob Storage and picking them up, but rather a direct integration.

Comments closed