Cleaning Up After Yourself in Azure Data Factory

Rayis Imayev shows how you can automatically delete old files in Azure Data Factory:

File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful. 

Click through to see how to do this.

Exporting SQL Server Tables to Excel with Powershell

Aaron Nelson shows how you can export the tables in a SQL Server database to Excel, using a warehouse as an example:

Obviously, you have to have the module installed, and a copy of AdventureWorksDW2017 db restored to a SQL Server.  After that,  all you have to do is loop through the tables, ‘query’ them with the Read-SqlTableData cmdlet, and pipe the results to the Export-Excel cmdlet.

I did some trial and error with this yesterday.  I settled on exporting all of the Dimension tables to separate Worksheets within the same Excel file, and exporting all of the Fact tables to their own individual files (since they tend to be much larger).

Click through for Aaron’s script.

Passing Messages to Azure Service Bus via Data Factory

Kevin Feasel

2019-03-11

Cloud, ETL

Rayis Imayev shows how we can use Logic Apps to let Azure Data Factory send messages to Azure Service Bus:

Summary:
1) Azure Data Factory and Service Bus can find common grounds to communicate with each other, and Azure Logic Apps could serve as a good mediator to establish this type of messaging communication.
2) As soon as messages land in a service bus queue, it’s now a responsibility of recipient side to obtain and process those message, which may be part of another blog post.

Click through for a demo of the process.

Deleting in Azure Data Factory

Kevin Feasel

2019-03-07

Cloud, ETL

Meagan Longoria is happy that Azure Data Factory v2 now has a Delete activity:

It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.

But now Data Factory V2 has a Delete activity.

Meagan shows how it works, what kinds of parameters you can set, and a couple of gotchas, so check it out.

Azure Data Factory Data Flows

Kevin Feasel

2019-03-04

Cloud, ETL

Marlon Ribunal shows how we can perform some amount of data transformation in an Azure Data Factory V2 data flow:

Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.

You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.

Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.

Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.

The Zen Of Airflow

Kevin Feasel

2019-02-22

ETL, Python

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow:

Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of Airflow; since it’s Python code there are infinite ways to define your pipelines. The Zen of Python is a list of 19 Python design principles and in this blog post I point out some of these principles on four Airflow examples. This blog was written with Airflow 1.10.2.

My favorite of the Zen of Python principles is a combination of two: “simple is better than complex; complex is better than complicated.” That’s something I don’t always get right, but it is critical for a stable architecture.

Regression Testing With Pester

Ust Oldfield continues a series on Pester testing:

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.

There’s a hefty test script here too, so check it out.

Saving An ADF Pipeline As A Template

Kevin Feasel

2019-02-14

Cloud, ETL

Rayis Imayev shares with us how you can save an Azure Data Factory pipeline as a template:

Azure Data Factory (ADF) provides you with a framework for creating data transformation solutions in the Microsoft cloud environment. Recently introduced Template Gallery for ADF pipelines can speed up this development process and provide you with helpful information to create additional activity tasks in your pipelines.

We naturally long to seek if something standard can be further adjusted. That custom design is like ordering a regular pizza and then hitting the “customize” button in order to add a few toppings of our choice. It would be very impressive then to save this customized “creation” for future ordering. And Azure Data Factory has a similar option to save your custom data transformation solutions (pipelines) as templates and reuse them later.

Click through to see how you can do just that.

Using The ROWVERSION Type For ETL

Max Vernon shows us how to use the ROWVERSION data type to tell how much work you have to do to ETL data over from one table to another:

The OLTP table implements a rowversion column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.

I think this is the first time I’ve seen someone use ROWVERSION types successfully.

Integrating Azure Data Factory With GitHub

Rayis Imayev shows us how to tie Azure Data Factory pipelines with GitHub, allowing automatic check-in based on ADF pipeline changes:

Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.

And prior to this point, all my sample ADF pipelines were developed in so-called “Live Data Factory Mode” using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn’t been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.

Click through for a detailed demo.

Categories

August 2019
MTWTFSS
« Jul  
 1234
567891011
12131415161718
19202122232425
262728293031