Press "Enter" to skip to content

Category: ETL / ELT

Logical Separation in Azure Data Factory

Rayis Imayev is at a crossroads:

I was raised listening and reading fairy tales where the main character would reach a crossroad with a large stone that had some directions written on it – turn right and you will lose your horse, turn left and you will lose your life, walk straight and you will find your happiness. 

Also, growing up in a small Ukrainian industrial city, closely situated to a railroad hub, I was always fascinated to see many colorful rail traffic lights, trying to imagine where a myriad of rail tracks would lead trains on them.

Similarly, Azure Data Factory (ADF) provides several ways, to control/direct/filter your pipeline workflows; it’s all conditioned and constrained to the boundaries of my “crossroad stone” writings.

As one of my intellectual heroes is purported to have said, if you see a fork in the road, take it.

Comments closed

Azure Data Factory Deployment Methods

Kamil Nowinski contrasts two methods for deploying Azure Data Factory pipelines:

Turned out that two-third of people use Microsoft’s deployment way, according to their answers on that poll (including few people who publish the code manually). In 1/3 cases people prefer to deploy directly from code. You may ask: what’re the differences? What characterizes both methods? Which one is better?
Before I start answering these questions, let me present both methods of publishing.

Read on to learn more about these approaches.

Comments closed

Little Things in Azure Data Factory

Rayis Imayev has some kind words about small niceties in Azure Data Factory:

Recently Microsoft team conducted a brief year-end survey about a “one thing” that Azure Data Factory (ADF) “made your day in 2020” – https://twitter.com/weehyong/status/1343709921104183296. There were different responses from the global parameters support to the limit increase of ADF instances per subscription.

I personally like the little things that are not easily detected on a surface, but with a deeper immersion into a data pipeline development, your level of gratefulness increases even more.

Click through for a few examples.

Comments closed

Azure Data Factory and Source Control

Ahmad Yaseen shows how you can save Azure Data Factory pipelines in source control:

To overcome these limitations, Azure Data Factory provides us with the ability to integrate with a GIT repository, such as Azure DevOps or GitHub repository, that helps in tracking and versioning the pipelines changes, and incrementally save the pipeline changes during the development stage, without the need to validate the incomplete pipeline, preventing these changes from being lost in case of any crash or failure. In this case, you will be able to test the pipeline, revert any change that is detected as a bug, and publish the pipeline to the Data Factory when everything is developed and validated successfully.

Click through for the setup instructions.

Comments closed

Transforming Arrays in Azure Data Factory

Mark Kromer shows off a few functions in Azure Data Factory to modify data in arrays:

The first transformation function is map() and allows you to apply data flow scalar functions as the 2nd parameter to the map() function. In my case, I use upper() to uppercase every element in my string array: map(columnNames(),upper(#item))

Read on for more iteration and aggregation functions akin to map, reduce, and filter.

Comments closed

Configuring a Linked Server to Oracle

Emanuele Meazzo needs to pull data from Oracle into SQL Server:

The most atrocious part of my search for glory was without doubt navigating all the packages to download and install for each component, between broken links and differences between the instructions and the actual content, it’s a mess.

It took a while, based on Emanuele’s tone. With SQL Server 2019, you can avoid some of this pain by using PolyBase. But for prior versions of SQL Server, your options are more limited.

Comments closed

Another Batch of ETL Antipatterns

Tim Mitchell wraps up a series on ETL antipatterns with three posts. The first one is about not testing the ETL process:

Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.

The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.

The second post looks at processes which don’t scale:

With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.

Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.

The final post implores us to think of the documentation:

Documentation is an asset that is both loathed and loved. Creating technical and business documentation is often looked upon as a tedious chore, something that really ought to be done for every project but is often an easy candidate to push until later (or skip entirely).

On the other hand, good documentation – particularly around data movement and ETL processes – is as valuable as the processes it describes. A clear and up-to-date document describing the what, when, where, and why of an ETL workflow adds transparency and makes the process much easier to understand for those who support it.

This has been an enjoyable series from Tim, so if you haven’t already, do check it out.

Comments closed

Retrieving Azure Log Analytics Data using Azure Data Factory

Meagan Longoria needs to move some log data around:

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

Click through for an example of moving this resultant data into Azure Storage.

Comments closed

Internal and External Azure Data Factory Pipeline Activities

Paul Andrew differentiates two form of pipeline activity:

Firstly, you might be thinking, why do I need to know this? Well, in my opinion, there are three main reasons for having an understanding of internal vs external activities:

1. Microsoft cryptically charges you a different rate of execution hours depending on the activity category when the pipeline is triggered. See the Azure Price Calculator.

2. Different resource limitations are enforced per subscription and region (not per Data Factory instance) depending on the activity category. See Azure Data Factory Resource Limitations.

3. I would suggest that understanding what compute is used for a given pipeline is good practice when building out complex control flows. For example, this relates to things like Hosted IR job concurrency, what resources can connect to what infrastructure and when activities may might become queued.

Paul warns that this is a dry topic, but these are important reasons to know the difference.

Comments closed

More ETL Antipatterns

Tim Mitchell continues a series on ETL anti-patterns. First up is lazy metadata:

Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.

Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement.

Read on to see if you suffer from lazy metadata. If so, ask your doctor if Metatol is right for you.

Next, the lack of error handling logic:

Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.

This is infuriating in how common it is, even among people who know better. Err, please don’t look at my ETL processes…

Finally, Tim reminds us that ETL logic is source code:

In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.

However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.

I think much of this comes from SSIS building out giant XML files which were essentially binaries. BIML went a long way toward allowing us to treat ETL with SSIS as a proper language and store code in source control, but even if you’re hand-editing SSIS packages, I agree with Tim completely.

Comments closed