Press "Enter" to skip to content

Category: ETL / ELT

Deleting in Azure Data Factory

Meagan Longoria is happy that Azure Data Factory v2 now has a Delete activity:

It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.

But now Data Factory V2 has a Delete activity.

Meagan shows how it works, what kinds of parameters you can set, and a couple of gotchas, so check it out.

Comments closed

Azure Data Factory Data Flows

Marlon Ribunal shows how we can perform some amount of data transformation in an Azure Data Factory V2 data flow:

Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.

You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.

Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.

Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.

Comments closed

The Zen Of Airflow

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow:

Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.

My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.

Comments closed

Regression Testing With Pester

Ust Oldfield continues a series on Pester testing:

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.

There’s a hefty test script here too, so check it out.

Comments closed

Saving An ADF Pipeline As A Template

Rayis Imayev shares with us how you can save an Azure Data Factory pipeline as a template:

Azure Data Factory (ADF) provides you with a framework for creating data transformation solutions in the Microsoft cloud environment. Recently introduced Template Gallery for ADF pipelines can speed up this development process and provide you with helpful information to create additional activity tasks in your pipelines.

We naturally long to seek if something standard can be further adjusted. That custom design is like ordering a regular pizza and then hitting the “customize” button in order to add a few toppings of our choice. It would be very impressive then to save this customized “creation” for future ordering. And Azure Data Factory has a similar option to save your custom data transformation solutions (pipelines) as templates and reuse them later.

Click through to see how you can do just that.

Comments closed

Using The ROWVERSION Type For ETL

Max Vernon shows us how to use the ROWVERSION data type to tell how much work you have to do to ETL data over from one table to another:

The OLTP table implements a rowversion column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.

I think this is the first time I’ve seen someone use ROWVERSION types successfully.

Comments closed

Integrating Azure Data Factory With GitHub

Rayis Imayev shows us how to tie Azure Data Factory pipelines with GitHub, allowing automatic check-in based on ADF pipeline changes:

Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.

And prior to this point, all my sample ADF pipelines were developed in so-called “Live Data Factory Mode” using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn’t been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.

Click through for a detailed demo.

Comments closed

Azure Data Factory Data Flows

Joost van Rossum takes a look at data flows in Azure Data Factory:

2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.

Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.

Comments closed

Data Transformation Tools In The Azure Space

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Comments closed

Apache Airflow Now A Top-Level Project

Fokko Driesprong announces that Apache Airflow is now a top-level Apache project:

Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.

Airflow has been getting some hype lately, especially in the AWS space.

Comments closed