Press "Enter" to skip to content

Category: ETL

Notes on Wrangling Data Flows

Rayis Imayev calculates distance between two geographical points in an Azure Data Factory Wrangling data flow:

Brian Donovan and Dan Work from the University of Illinois has pointed out that this dataset “contains a large number of errors. For example, there are several trips where the reported meter distances are significantly shorter than the straight-line distance, violating Euclidean geometry“. So, that triggered my interest to add an additional column to this dataset with a straight line distance between two geo-points of pickup and dropoff locations, and that’s where I wanted Wrangling Data Flows to help me.

Read on for Rayis’s demonstration, as well as a long list of observations (positive and negative) about the current state of Wrangling data flows.

Leave a Comment

Debugging Azure Data Factory Data Flows

Mark Kromer takes us through debugging Azure Data Factory Data Flows:

When you are designing your mapping data flows in ADF, you are working against a live Azure Databricks Spark cluster. The size of that cluster is configurable via the Azure Integration Runtime. If you do not configure a custom Azure IR, then you will use the default Azure IR. That sets a very small cluster size by default of 4 cores for a single worker node and 4 cores for a single driver node. In most cases, while debugging and using data preview, that should be fine. But when you start exploring your data with column statistics or increase the sampling size in debug settings, you may find that you’ve exceeded the capacity on that small default cluster. Below are the steps you need to take to increase the size of your debug cluster.

Click through for step-by-step instructions.

Leave a Comment

Embedding SSIS Packages in Azure Data Factory Pipelines

Andy Leonard shows us how to embed an SSIS package inside Azure Data Factory pipelines:

The Azure-SSIS Team has done it again; they’ve added more cool SSIS execution functionality to Azure Data Factory!

Click through to see what has Andy excited. I think this is a big thing for ADF as well, especially in shops which dedicated a lot of time and energy into building SSIS packages for ETL work over the years.

Leave a Comment

Validating Upstream Data Quality with T-SQL

Ed Elliott has a pattern to try:

We can select those rows which we want into our real table which will either already have constraints enabled or we can enable the constraints after we load the data. There are a few problems with this approach, the first problem is that this isn’t very easy to debug. When you realise that your load process only loaded half of the expected rows, why didn’t it load the rest? Other problems include, what do we do with any failed rows – i.e. which rows failed? To answer this we need another query that is the reverse of the “get good rows”. Finally, this is quite a lot of SQL for a file with four columns and three rules, what if the rules are more complex and our data file has lots of columns? I tell you what happens if the rules are more complex: “a mess of SQL jank”, that’s what.

So what instead? 

Read on to learn about the alternative, which is an interesting approach. This is another case where a good SQL-based solution is almost a pivot of a good imperative solution: instead of thinking in rows, focus on columns first and let the rows sort themselves out.

Leave a Comment

Incremental Data Migration to Blob Storage

Ginger Daniel has started a series on data migration into Azure Blob Storage:

Part 1 of this article demonstrates how to upload multiple tables from an on-premise SQL Server to an Azure Blob Storage account as csv files.  I covered these basic steps to get data from one place to the other using Azure Data Factory, however there are many other alternative ways to accomplish this, and many details in these steps that were not covered.  For a deep-dive into the details you can start here https://docs.microsoft.com/en-us/azure/data-factory/introduction, and https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-pipeline

Part 1 was chock full of information, and it looks like Part 2 will be as well.

Leave a Comment

Azure Data Factory Switch Activity

Rayis Imayev explains what the Switch activity does in Azure Data Factory:

Developing conditional logic of your Azure Data Factory control flow has been simplified with introducing of the Switch activity – https://docs.microsoft.com/en-us/azure/data-factory/control-flow-switch-activity. Official documentation resource states, this new data factory activity “provides the same functionality that a switch statement provides in programming languages“. I would also add a more simplified definition of the Switch activity in Azure Data Factory: it is a container (or wrapper) for multiple IF conditions.

Click through for an example.

Comments closed

Variable Header Counts and Azure Data Factory

Mark Kromer shows how you can convince Azure Data Factory to skip a variable number of lines before beginning processing:

A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.

In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique. 

Read on to see which technique Mark chose and how to get it working.

Comments closed

Azure Data Factory Pipeline Hierarchies

Paul Andrew explains the idea of pipeline hierarchies with respect to Azure Data Factory:

Next, even if the concept isn’t new, where I’d like to call out two big differences in my approach to orchestration with ADF comes from working within Microsoft Azure. The highly scalable cloud platform presents some new challenges that SSIS simply didn’t. For me these are:

– Needing to consider our wider solution and what things now cost. I’m fairly sure I’ve said it before. When working with ‘Pay-as-you-go’ services we need to think about designing for cost/consumption as well as all our other data transformation and output requirements. In Azure it is so easy to just leave resources running night and day, when only a short window of compute is needed.
– We need to consider the scale out capabilities of the other services that ADF is going to invoke. Or, to put it another way, how much parallel activity execution do we want ADF to achieve? As you may know the ADF ForEach activity by default allows us to execution inner activities in parallel, but is that enough?

It’s a very interesting idea; read the whole thing.

Comments closed

Azure Data Factory and Schema Drift

Mark Kromer walks us through two techniques we can use in Azure Data Factory to deal with schema drift:

Azure Data Factory’s Mapping Data Flows have built-in capabilities to handle complex ETL scenarios that include the ability to handle flexible schemas and changing source data. We call this capability “schema drift“.

When you build transformations that need to handle changing source schemas, your logic becomes tricky. In ADF, you can either build data flows that always look for patterns in the source and utilize generic transformation functions, or you can add a Derived Column that defines your flow’s canonical model.

Click through for the discussion and comparison. Schema drift has been the bane of Integration Services’s existence, so it’s good to see them tackling the idea in Azure Data Factory.

Comments closed