Press "Enter" to skip to content

Category: ETL / ELT

ETL Anti-Patterns: a Festivus Miracle

Tim Mitchell is ready to air some grievances:

We’re rounding the corner to the second half of December, which means it’s time for my favorite holiday: Festivus! Like many of you, I enjoy gathering around the Festivus pole and sharing the time-honored traditions such as the Feats Of Strength and the Airing Of Grievances.

But my favorite Festivus tradition takes place right here on this blog: the Eleven Days of Festivus. Each year, I write a daily blog post each of the eleven days leading up to Festivus, usually around a central theme. 

Tim has three posts up so far. First is around jumping straight into the code-writing phase:

Most data architects and developers are intensely curious folks. When we see a set of data, we want to immediately step into a data whisperer role. Where others may see a jumbled mess, we see an opportunity to discover patterns and answers. The best data architects crave those data discovery finds the same way a baseball player craves a bottom-of-the-9th game-winning home run.

That kind of intellectual curiosity is a necessary trait for data architects, but it can lead to a rush straight into writing ETL code. I’ve seen this a lot, and have done it myself (and admittedly still do it on occasion): skipping past the business-value analysis and diving straight into the haystack looking for needles. Getting raw data into a format that can easily be analyzed and validated is a critical part of the ETL development life cycle, but rarely is it the first step.

Second, processing too much data:

A common design flaw in enterprise ETL processes is that they are processing too much data. Having access to a great breadth and depth of data opens up lots of options for historical reporting and data analytics, but very often it is mistakenly assumed that all of the available data must be processed through ETL.

Although it may sound counterintuitive, there are many cases where purposefully leaving some data out of the ETL process leads to a better outcome. 

Third is performing full loads when incremental loads are possible:

Earlier this year, I wrote about the concepts of incremental loads and discussed the benefits of loading data incrementally. To recap: an incremental load moves only the new and changed data from each source – rather than the entire bulk of the source data – through the ETL pipeline.

Using incremental loads can improve both the speed and accuracy of data movement and transformation. The time required to process data increases with the volume of said data, and extracting only the new and changed data from the source can ensure an accurate ‘point-in-time’ representation of the data. For these reasons, loading data incrementally is, for most data load needs, the better way to go.

This is a good series to track.

Comments closed

BULK INSERT and Advent of Code

Thomas Rushton performs BULK INSERT I imagine the way most of us do, through trial and lots of error:

Step one in solving these problems in SQL Server is loading the data into a database so you can do something with it.

I chose to use BULK INSERT, because, according to the documentation, it:

Imports a data file into a database table or view in a user-specified format in SQL Server

Ideal, right?

Click through for several tips around BULK INSERT.

Comments closed

ETL with R in SQL Server

Rajendra Gupta shows one reason for using R inside SQL Server:

Data professionals get requests to import, export data into various formats. These formats can be such as Comma-separated data(.CSV), Excel, HTML, JSON, YAML, Tab-separated data(.TSV). Usually, we use SQL Server integration service ETL packages for data transformations, import or export data.

SQL Machine Learning can be useful in dealing with various file formats. In the article, External packages in R SQL Server, we explored the R services and the various external packages for performing tasks using R scripts.

In this article, we explore the useful Rio package developed by Thomas J. Leeper to simplify the data export and import process.

One thing I’d like to reiterate is that even though you’re using R to move this data, you don’t need to perform any data science activities on the data—R can be the easiest approach for getting and cleaning up certain types of data.

Comments closed

So You’ve Hit the Limits of ADF Concurrency

Paul Andrew shows what happens you you break the ADF concurrency barrier:

Firstly, understanding how these limits apply to your Data Factory pipelines takes a little bit of thinking about considering you need to understand the difference between an internal and external activity. Then you need to think about this with the caveats of being per subscription and importantly per Azure Integration Runtime region.

Assuming you know that, and you’ve hit these limits!

Click through to see what happens. It’s not pretty.

Comments closed

Working with Self-Hosted Integration Runtimes

Craig Porteous walks us through some of the planning necessary for self-hosted integration runtimes:

If your Data Factory contains a self-hosted Integration runtime, you will need to do some planning work before everything will work nicely with CI/CD pipelines. Unlike all other resources in your Data Factory, runtimes won’t deploy cleanly between environments, primarily as you connect the installed runtime directly to a single Data Factory. (We can add more runtime nodes to a single Data Factory but we cannot share a single node between many data factories*). An excerpt from Microsoft’s docs on Continuous integration and delivery in Azure Data Factory mentions this caveat.

Read on for the consequences and two options available to you.

Comments closed

Azure Data Factory Integration Runtimes

Tino Zishiri takes us through the concept of the Integration Runtime:

An Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities such as Data Flows and Data Movement. It has access to resources in either public networks, or hybrid scenarios (public and private networks).

Read on to learn more about what they do and the variety of Integration Runtimes available to you.

Comments closed

Using Delimiters when Performing Change Detection with Hashbytes

Andy Leonard clues us in on an important character:

A student of my SSIS training asked why I place vertical pipe delimiters between column values when calculating the hash value for a row. I sent back a quick response and then informed him that he’d inspired a blog post.

Hence, this blog post.

To understand what I am talking about, you need to know that I define a 100% efficient data integration operation as a data integration process – such as an SSIS package or ADF pipeline – that loads only new and changed rows from a source. I explain this in the video. To achieve 100% efficiency, I use the T-SQL HashBytes function for change detection.

Check out the full reasoning. A secondary reason for using delimiters is that if you’re using the CONCAT() function to combine rather than calling CAST() or CONVERT() on everything individually, CONCAT() requires at least two inputs. If you always put a delimiter at the end as well, you have at least two inputs, guaranteed. It’s a small thing, but adds to consistency when, say, you only have one non-key, type 2 field on a dimension.

Comments closed

The ETLT Pattern

Abe Dearmer bridges the gap:

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case. 

The idea can certainly be useful.

Comments closed

Check if an ADF Pipeline is Already Running

Paul Andrew has a scenario for us:

Scenario: I want to trigger a Data Factory pipeline, but when I do I want the pipeline to know if it’s already running. If it is already running, stop the new run.

Sounds simple enough right?

Wrong!

But, now simple for you, because I’ve done it for you, yay! 🙂

I thought it was simple, but it wasn’t simple, but now it’s simple, but is it really simple? Click through to find out.

Comments closed

Auto-Checking Azure Data Factory Setup

Paul Andrew is at it again:

Building on the work done and detailed in my previous blog post (Best Practices for Implementing Azure Data Factory) I was tasked by my delightful boss to turn this content into a simple check list of what/why that others could use…. I slightly reluctantly did so. However, I wanted to do something better than simply transcribe the previous blog post into a check list. I therefore decided to breakout the Shell of Power and attempt to automate said check list.

Sure, a check list could be picked up and used by anyone – with answers manually provided by the person doing the inspection of a given ADF resource. But what if there was a way to have the results given to you a plate and inferring things that aren’t always easy to spot via the Data Factory UI.

Paul uses an ARM template rather than hitting your Data Factory directly, so there’s a little bit more work for you the user, but Paul explains why it’s both necessary and proper.

Comments closed