Press "Enter" to skip to content

A Critique (and Defense) of Generic Programming Languages for ETL/ELT

Teo Lachev doesn’t like general programming languages for ETL and ELT operations:

Someone asked the other day for my opinion about the open-source dbt tool for ETL. I hadn’t heard about it. Next thing I’ve noticed was that Fabric Warehouse added support for it, so I got inspired to take a first look. Seems like an ELT-oriented tool. Good, I’m a big fan of the ELT pattern whose virtues I extolled I discussed many times here. But a Python-based tool that requires writing custom code for orchestration in a dev environment, such as Visual Studio Code? Yuck!

My reasoning is simple: complexity. Bespoke ETL/ELT tools like SQL Server Information Services, Informatica, Azure Data Factory, Airflow, and the like are good when you fit into their primary use cases: moving data from a few data sources into a destination, perhaps with some level of transformation in between.

But here are areas off the top of my head where I’ve seen these tools not work well:

  • Wide scale. In one environment, we had to move contents from a couple thousand databases (with identical schemas) across 50-60 instances of SQL Server into a warehouse, including some facts and dimensions we needed within a minute or two. Even assuming those packages don’t change frequently—not a reasonable assumption—the pains of orchestrating that would be enormous. I don’t think we could have used metadata-driven approach and foreach loops in your ADF workflows, either, as that would not satisfy the time requirements. There are also resource limitation requirements on the other side—you don’t want to overwhelm the warehouse by trying to process a couple thousand clients’ worth of data all at once, so you’ve got to stagger this work using an orchestration engine with enough smarts to limit concurrent processes.
  • Limiting copy-paste efforts and drudgery. Going back to SSIS, it sucks having to maintain dozens of packages, especially common components you need to update in each one. I got to be pretty good at Biml, but a) that has its limits, and b) that’s C# development with SSIS packages as an output, so I’m claiming that for the generic programming languages side of the argument.