ETL / ELT – Page 31 – Curated SQL

Validating Upstream Data Quality with T-SQL

Published 2019-10-30 by Kevin Feasel

We can select those rows which we want into our real table which will either already have constraints enabled or we can enable the constraints after we load the data. There are a few problems with this approach, the first problem is that this isn’t very easy to debug. When you realise that your load process only loaded half of the expected rows, why didn’t it load the rest? Other problems include, what do we do with any failed rows – i.e. which rows failed? To answer this we need another query that is the reverse of the “get good rows”. Finally, this is quite a lot of SQL for a file with four columns and three rules, what if the rules are more complex and our data file has lots of columns? I tell you what happens if the rules are more complex: “a mess of SQL jank”, that’s what.
So what instead?

Read on to learn about the alternative, which is an interesting approach. This is another case where a good SQL-based solution is almost a pivot of a good imperative solution: instead of thinking in rows, focus on columns first and let the rows sort themselves out.

Comments closed

Incremental Data Migration to Blob Storage

Published 2019-10-29 by Kevin Feasel

Ginger Daniel has started a series on data migration into Azure Blob Storage:

Part 1 of this article demonstrates how to upload multiple tables from an on-premise SQL Server to an Azure Blob Storage account as csv files. I covered these basic steps to get data from one place to the other using Azure Data Factory, however there are many other alternative ways to accomplish this, and many details in these steps that were not covered. For a deep-dive into the details you can start here https://docs.microsoft.com/en-us/azure/data-factory/introduction, and https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-pipeline.

Part 1 was chock full of information, and it looks like Part 2 will be as well.

Comments closed

Event-Driven Architecture with Azure Data Factory

Published 2019-10-28 by Kevin Feasel

Rayis Imayev walks us through two scenarios using file event triggers in Azure Data Factory:

Before we get into more details on how to use this new “Ignore empty blobs” feature, let’s briefly review possible scenarios of using file event triggers in your data processing workflow.

Read on for a quick overview of the two approaches.

Comments closed

Azure Data Factory Switch Activity

Published 2019-10-22 by Kevin Feasel

Rayis Imayev explains what the Switch activity does in Azure Data Factory:

Developing conditional logic of your Azure Data Factory control flow has been simplified with introducing of the Switch activity – https://docs.microsoft.com/en-us/azure/data-factory/control-flow-switch-activity. Official documentation resource states, this new data factory activity “provides the same functionality that a switch statement provides in programming languages“. I would also add a more simplified definition of the Switch activity in Azure Data Factory: it is a container (or wrapper) for multiple IF conditions.

Click through for an example.

Comments closed

Variable Header Counts and Azure Data Factory

Published 2019-09-30 by Kevin Feasel

Mark Kromer shows how you can convince Azure Data Factory to skip a variable number of lines before beginning processing:

A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.
In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique.

Read on to see which technique Mark chose and how to get it working.

Comments closed

Azure Data Factory Pipeline Hierarchies

Published 2019-09-27 by Kevin Feasel

Paul Andrew explains the idea of pipeline hierarchies with respect to Azure Data Factory:

Next, even if the concept isn’t new, where I’d like to call out two big differences in my approach to orchestration with ADF comes from working within Microsoft Azure. The highly scalable cloud platform presents some new challenges that SSIS simply didn’t. For me these are:
– Needing to consider our wider solution and what things now cost. I’m fairly sure I’ve said it before. When working with ‘Pay-as-you-go’ services we need to think about designing for cost/consumption as well as all our other data transformation and output requirements. In Azure it is so easy to just leave resources running night and day, when only a short window of compute is needed.
– We need to consider the scale out capabilities of the other services that ADF is going to invoke. Or, to put it another way, how much parallel activity execution do we want ADF to achieve? As you may know the ADF ForEach activity by default allows us to execution inner activities in parallel, but is that enough?

It’s a very interesting idea; read the whole thing.

Comments closed

Azure Data Factory and Schema Drift

Published 2019-09-17 by Kevin Feasel

Mark Kromer walks us through two techniques we can use in Azure Data Factory to deal with schema drift:

Azure Data Factory’s Mapping Data Flows have built-in capabilities to handle complex ETL scenarios that include the ability to handle flexible schemas and changing source data. We call this capability “schema drift“.
When you build transformations that need to handle changing source schemas, your logic becomes tricky. In ADF, you can either build data flows that always look for patterns in the source and utilize generic transformation functions, or you can add a Derived Column that defines your flow’s canonical model.

Click through for the discussion and comparison. Schema drift has been the bane of Integration Services’s existence, so it’s good to see them tackling the idea in Azure Data Factory.

Comments closed

Troubleshooting AWS Database Migration Service Errors

Published 2019-09-11 by Kevin Feasel

Samir Behara takes us through troubleshooting AWS Database Migration Service issues:

For troubleshooting any issues with AWS DMS, it is necessary to have logs enabled. The DMS logs would typically give a better picture and helps find errors or warnings that would indicate the root cause of the failure. If the logs are not available there is nothing much you can do from a detailed troubleshooting analysis perspective. So basically next step is to turn on DMS logs and kick the job again and validate if the errors are captured in the logs.
If logs are not enabled, you need to set up a new task with logging enabled so if and when it errors out, you can take a look and troubleshoot the same.

I’ll save my full rant for another day, but I’m not that impressed with DMS. It could be a failing on my part, though.

Comments closed

Proving ETL Correctness

Published 2019-08-13 by Kevin Feasel

Ed Elliott shares a few techniques for testing ETL processes:

Reconciliation is the process of going to your source system, getting a number and validating that number on the target. This ranges from being easy to impossible, so you need to decide what to reconcile on a case by case basis.
In its simplest form, we can go to a source system and find out things like how many records are to be copied, sum up totals and run other aggregations that we can then validate as correct (or not!) on the target system.

Ed has put together a thoughtful approach to validating data loads regardless of the source.

Comments closed

SSIS Design Preferences

Published 2019-08-09 by Kevin Feasel

Meagan Longoria systematizes a set of preferences regarding Integration Services package and ETL process design:

– Every table should have InsertDateTime and UpdateDateTime columns. The UpdateDateTime column should be populated with the same value as the InsertDateTime column upon creation of the row, rather than being left null.
– Whatever you use to create tables, include primary keys, foreign keys, and indexes with your table definitions. Provide explicit constraint names to simplify database comparisons. You can disable your foreign keys, but they need to be there to provide that metadata.
– Separate your final dimensional/reporting tables from audit tables and staging tables. This can be done with separate schemas or even separate databases.

People have added some more thoughts in the comments as well.

Comments closed

Category: ETL / ELT