Press "Enter" to skip to content

Category: ETL / ELT

Internal and External Azure Data Factory Pipeline Activities

Paul Andrew differentiates two form of pipeline activity:

Firstly, you might be thinking, why do I need to know this? Well, in my opinion, there are three main reasons for having an understanding of internal vs external activities:

1. Microsoft cryptically charges you a different rate of execution hours depending on the activity category when the pipeline is triggered. See the Azure Price Calculator.

2. Different resource limitations are enforced per subscription and region (not per Data Factory instance) depending on the activity category. See Azure Data Factory Resource Limitations.

3. I would suggest that understanding what compute is used for a given pipeline is good practice when building out complex control flows. For example, this relates to things like Hosted IR job concurrency, what resources can connect to what infrastructure and when activities may might become queued.

Paul warns that this is a dry topic, but these are important reasons to know the difference.

Comments closed

More ETL Antipatterns

Tim Mitchell continues a series on ETL anti-patterns. First up is lazy metadata:

Metadata management in ETL processes can be challenging. When exchanging data between systems and across different mediums (unstructured data, flat files, XML, or relational data), the definition of data types can get blurry. Some sources – including unstructured data and flat files – have little if any metadata configuration. On the other end of the spectrum is relational (RDBMS) data, which strictly defines data types and constraints.

Because of the challenges of this exchange process, many developers opt for an approach that I refer to as lazy metadata. In this design pattern, the ETL pipeline is configured very generically, using little if any metadata definition beyond that which is absolutely necessary for data movement.

Read on to see if you suffer from lazy metadata. If so, ask your doctor if Metatol is right for you.

Next, the lack of error handling logic:

Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.

This is infuriating in how common it is, even among people who know better. Err, please don’t look at my ETL processes…

Finally, Tim reminds us that ETL logic is source code:

In most data projects, building the extract-transform-load (ETL) logic takes a significant amount of time. Enterprise ETL processes must do several things well: retrieve enough data to satisfy the business needs, apply any needed transformations to that data, and load it to the destination(s) without interruption to any other business processes. The work that goes into building and validating that ETL logic can be significant, making the resulting code a very valuable asset to the enterprise.

However, in my travels I’ve discovered that there’s a lot of ETL code that doesn’t get the kind of care it deserves. Failing to treat ETL logic as source code can be a costly and time-consuming mistake.

I think much of this comes from SSIS building out giant XML files which were essentially binaries. BIML went a long way toward allowing us to treat ETL with SSIS as a proper language and store code in source control, but even if you’re hand-editing SSIS packages, I agree with Tim completely.

Comments closed

ETL Anti-Patterns: a Festivus Miracle

Tim Mitchell is ready to air some grievances:

We’re rounding the corner to the second half of December, which means it’s time for my favorite holiday: Festivus! Like many of you, I enjoy gathering around the Festivus pole and sharing the time-honored traditions such as the Feats Of Strength and the Airing Of Grievances.

But my favorite Festivus tradition takes place right here on this blog: the Eleven Days of Festivus. Each year, I write a daily blog post each of the eleven days leading up to Festivus, usually around a central theme. 

Tim has three posts up so far. First is around jumping straight into the code-writing phase:

Most data architects and developers are intensely curious folks. When we see a set of data, we want to immediately step into a data whisperer role. Where others may see a jumbled mess, we see an opportunity to discover patterns and answers. The best data architects crave those data discovery finds the same way a baseball player craves a bottom-of-the-9th game-winning home run.

That kind of intellectual curiosity is a necessary trait for data architects, but it can lead to a rush straight into writing ETL code. I’ve seen this a lot, and have done it myself (and admittedly still do it on occasion): skipping past the business-value analysis and diving straight into the haystack looking for needles. Getting raw data into a format that can easily be analyzed and validated is a critical part of the ETL development life cycle, but rarely is it the first step.

Second, processing too much data:

A common design flaw in enterprise ETL processes is that they are processing too much data. Having access to a great breadth and depth of data opens up lots of options for historical reporting and data analytics, but very often it is mistakenly assumed that all of the available data must be processed through ETL.

Although it may sound counterintuitive, there are many cases where purposefully leaving some data out of the ETL process leads to a better outcome. 

Third is performing full loads when incremental loads are possible:

Earlier this year, I wrote about the concepts of incremental loads and discussed the benefits of loading data incrementally. To recap: an incremental load moves only the new and changed data from each source – rather than the entire bulk of the source data – through the ETL pipeline.

Using incremental loads can improve both the speed and accuracy of data movement and transformation. The time required to process data increases with the volume of said data, and extracting only the new and changed data from the source can ensure an accurate ‘point-in-time’ representation of the data. For these reasons, loading data incrementally is, for most data load needs, the better way to go.

This is a good series to track.

Comments closed

BULK INSERT and Advent of Code

Thomas Rushton performs BULK INSERT I imagine the way most of us do, through trial and lots of error:

Step one in solving these problems in SQL Server is loading the data into a database so you can do something with it.

I chose to use BULK INSERT, because, according to the documentation, it:

Imports a data file into a database table or view in a user-specified format in SQL Server

Ideal, right?

Click through for several tips around BULK INSERT.

Comments closed

ETL with R in SQL Server

Rajendra Gupta shows one reason for using R inside SQL Server:

Data professionals get requests to import, export data into various formats. These formats can be such as Comma-separated data(.CSV), Excel, HTML, JSON, YAML, Tab-separated data(.TSV). Usually, we use SQL Server integration service ETL packages for data transformations, import or export data.

SQL Machine Learning can be useful in dealing with various file formats. In the article, External packages in R SQL Server, we explored the R services and the various external packages for performing tasks using R scripts.

In this article, we explore the useful Rio package developed by Thomas J. Leeper to simplify the data export and import process.

One thing I’d like to reiterate is that even though you’re using R to move this data, you don’t need to perform any data science activities on the data—R can be the easiest approach for getting and cleaning up certain types of data.

Comments closed

So You’ve Hit the Limits of ADF Concurrency

Paul Andrew shows what happens you you break the ADF concurrency barrier:

Firstly, understanding how these limits apply to your Data Factory pipelines takes a little bit of thinking about considering you need to understand the difference between an internal and external activity. Then you need to think about this with the caveats of being per subscription and importantly per Azure Integration Runtime region.

Assuming you know that, and you’ve hit these limits!

Click through to see what happens. It’s not pretty.

Comments closed

Working with Self-Hosted Integration Runtimes

Craig Porteous walks us through some of the planning necessary for self-hosted integration runtimes:

If your Data Factory contains a self-hosted Integration runtime, you will need to do some planning work before everything will work nicely with CI/CD pipelines. Unlike all other resources in your Data Factory, runtimes won’t deploy cleanly between environments, primarily as you connect the installed runtime directly to a single Data Factory. (We can add more runtime nodes to a single Data Factory but we cannot share a single node between many data factories*). An excerpt from Microsoft’s docs on Continuous integration and delivery in Azure Data Factory mentions this caveat.

Read on for the consequences and two options available to you.

Comments closed

Azure Data Factory Integration Runtimes

Tino Zishiri takes us through the concept of the Integration Runtime:

An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities such as Data Flows and Data Movement. It has access to resources in either public networks, or hybrid scenarios (public and private networks).

Read on to learn more about what they do and the variety of Integration Runtimes available to you.

Comments closed

Using Delimiters when Performing Change Detection with Hashbytes

Andy Leonard clues us in on an important character:

A student of my SSIS training asked why I place vertical pipe delimiters between column values when calculating the hash value for a row. I sent back a quick response and then informed him that he’d inspired a blog post.

Hence, this blog post.

To understand what I am talking about, you need to know that I define a 100% efficient data integration operation as a data integration process – such as an SSIS package or ADF pipeline – that loads only new and changed rows from a source. I explain this in the video. To achieve 100% efficiency, I use the T-SQL HashBytes function for change detection.

Check out the full reasoning. A secondary reason for using delimiters is that if you’re using the CONCAT() function to combine rather than calling CAST() or CONVERT() on everything individually, CONCAT() requires at least two inputs. If you always put a delimiter at the end as well, you have at least two inputs, guaranteed. It’s a small thing, but adds to consistency when, say, you only have one non-key, type 2 field on a dimension.

Comments closed

The ETLT Pattern

Abe Dearmer bridges the gap:

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case. 

The idea can certainly be useful.

Comments closed