Category: ETL / ELT

Little Things in Azure Data Factory

Published 2021-01-13 by Kevin Feasel

Rayis Imayev has some kind words about small niceties in Azure Data Factory:

Recently Microsoft team conducted a brief year-end survey about a “one thing” that Azure Data Factory (ADF) “made your day in 2020” – https://twitter.com/weehyong/status/1343709921104183296. There were different responses from the global parameters support to the limit increase of ADF instances per subscription.
I personally like the little things that are not easily detected on a surface, but with a deeper immersion into a data pipeline development, your level of gratefulness increases even more.

Click through for a few examples.

Comments closed

Azure Data Factory and Source Control

Published 2021-01-12 by Kevin Feasel

Ahmad Yaseen shows how you can save Azure Data Factory pipelines in source control:

To overcome these limitations, Azure Data Factory provides us with the ability to integrate with a GIT repository, such as Azure DevOps or GitHub repository, that helps in tracking and versioning the pipelines changes, and incrementally save the pipeline changes during the development stage, without the need to validate the incomplete pipeline, preventing these changes from being lost in case of any crash or failure. In this case, you will be able to test the pipeline, revert any change that is detected as a bug, and publish the pipeline to the Data Factory when everything is developed and validated successfully.

Click through for the setup instructions.

Comments closed

Transforming Arrays in Azure Data Factory

Published 2021-01-08 by Kevin Feasel

Mark Kromer shows off a few functions in Azure Data Factory to modify data in arrays:

The first transformation function is map() and allows you to apply data flow scalar functions as the 2nd parameter to the map() function. In my case, I use upper() to uppercase every element in my string array: map(columnNames(),upper(#item))

Read on for more iteration and aggregation functions akin to map, reduce, and filter.

Comments closed

Configuring a Linked Server to Oracle

Published 2021-01-07 by Kevin Feasel

Emanuele Meazzo needs to pull data from Oracle into SQL Server:

The most atrocious part of my search for glory was without doubt navigating all the packages to download and install for each component, between broken links and differences between the instructions and the actual content, it’s a mess.

It took a while, based on Emanuele’s tone. With SQL Server 2019, you can avoid some of this pain by using PolyBase. But for prior versions of SQL Server, your options are more limited.

Comments closed

Another Batch of ETL Antipatterns

Published 2020-12-28 by Kevin Feasel

Tim Mitchell wraps up a series on ETL antipatterns with three posts. The first one is about not testing the ETL process:

Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.
The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.

The second post looks at processes which don’t scale:

With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.
Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.

The final post implores us to think of the documentation:

Documentation is an asset that is both loathed and loved. Creating technical and business documentation is often looked upon as a tedious chore, something that really ought to be done for every project but is often an easy candidate to push until later (or skip entirely).
On the other hand, good documentation – particularly around data movement and ETL processes – is as valuable as the processes it describes. A clear and up-to-date document describing the what, when, where, and why of an ETL workflow adds transparency and makes the process much easier to understand for those who support it.

This has been an enjoyable series from Tim, so if you haven’t already, do check it out.

Comments closed

Retrieving Azure Log Analytics Data using Azure Data Factory

Published 2020-12-28 by Kevin Feasel

Meagan Longoria needs to move some log data around:

For this project, we have several Azure SQL Databases configured to send logs and metrics to a Log Analytics workspace. You can execute KQL queries against the workspace in the Log Analytics user interface in the Azure Portal, a notebook in Azure Data Studio, or directly through the API. The resulting format of the data downloaded from the API leaves something to be desired (it’s like someone shoved a CSV inside a JSON document), but it’s usable after a bit of parsing based upon column position. Just be sure your KQL query actually states the columns and their order (this can be done using the Project operator).

Click through for an example of moving this resultant data into Azure Storage.

Comments closed

Internal and External Azure Data Factory Pipeline Activities

Published 2020-12-23 by Kevin Feasel

Paul Andrew differentiates two form of pipeline activity:

Firstly, you might be thinking, why do I need to know this? Well, in my opinion, there are three main reasons for having an understanding of internal vs external activities:
1. Microsoft cryptically charges you a different rate of execution hours depending on the activity category when the pipeline is triggered. See the Azure Price Calculator.
2. Different resource limitations are enforced per subscription and region (not per Data Factory instance) depending on the activity category. See Azure Data Factory Resource Limitations.
3. I would suggest that understanding what compute is used for a given pipeline is good practice when building out complex control flows. For example, this relates to things like Hosted IR job concurrency, what resources can connect to what infrastructure and when activities may might become queued.

Paul warns that this is a dry topic, but these are important reasons to know the difference.

Comments closed

More ETL Antipatterns

Published 2020-12-22 by Kevin Feasel

Tim Mitchell continues a series on ETL anti-patterns. First up is lazy metadata:

Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.
Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement.

Read on to see if you suffer from lazy metadata. If so, ask your doctor if Metatol is right for you.

Next, the lack of error handling logic:

Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.

This is infuriating in how common it is, even among people who know better. Err, please don’t look at my ETL processes…

Finally, Tim reminds us that ETL logic is source code:

In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.
However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.

I think much of this comes from SSIS building out giant XML files which were essentially binaries. BIML went a long way toward allowing us to treat ETL with SSIS as a proper language and store code in source control, but even if you’re hand-editing SSIS packages, I agree with Tim completely.

Comments closed

ETL Anti-Patterns: a Festivus Miracle

Published 2020-12-16 by Kevin Feasel

Tim Mitchell is ready to air some grievances:

We’re rounding the corner to the second half of December, which means it’s time for my favorite holiday: Festivus! Like many of you, I enjoy gathering around the Festivus pole and sharing the time-honored traditions such as the Feats Of Strength and the Airing Of Grievances.
But my favorite Festivus tradition takes place right here on this blog: the Eleven Days of Festivus. Each year, I write a daily blog post each of the eleven days leading up to Festivus, usually around a central theme.

Tim has three posts up so far. First is around jumping straight into the code-writing phase:

Most data architects and developers are intensely curious folks. When we see a set of data, we want to immediately step into a data whisperer role. Where others may see a jumbled mess, we see an opportunity to discover patterns and answers. The best data architects crave those data discovery finds the same way a baseball player craves a bottom-of-the-9th game-winning home run.
That kind of intellectual curiosity is a necessary trait for data architects, but it can lead to a rush straight into writing ETL code. I’ve seen this a lot, and have done it myself (and admittedly still do it on occasion): skipping past the business-value analysis and diving straight into the haystack looking for needles. Getting raw data into a format that can easily be analyzed and validated is a critical part of the ETL development life cycle, but rarely is it the first step.

Second, processing too much data:

A common design flaw in enterprise ETL processes is that they are processing too much data. Having access to a great breadth and depth of data opens up lots of options for historical reporting and data analytics, but very often it is mistakenly assumed that all of the available data must be processed through ETL.
Although it may sound counterintuitive, there are many cases where purposefully leaving some data out of the ETL process leads to a better outcome.

Third is performing full loads when incremental loads are possible:

Earlier this year, I wrote about the concepts of incremental loads and discussed the benefits of loading data incrementally. To recap: an incremental load moves only the new and changed data from each source – rather than the entire bulk of the source data – through the ETL pipeline.
Using incremental loads can improve both the speed and accuracy of data movement and transformation. The time required to process data increases with the volume of said data, and extracting only the new and changed data from the source can ensure an accurate ‘point-in-time’ representation of the data. For these reasons, loading data incrementally is, for most data load needs, the better way to go.

This is a good series to track.

Comments closed

BULK INSERT and Advent of Code

Published 2020-12-07 by Kevin Feasel

Thomas Rushton performs BULK INSERT I imagine the way most of us do, through trial and lots of error:

Step one in solving these problems in SQL Server is loading the data into a database so you can do something with it.
I chose to use BULK INSERT, because, according to the documentation, it:
Imports a data file into a database table or view in a user-specified format in SQL Server
Ideal, right?

Click through for several tips around BULK INSERT.

Comments closed