Category: ETL / ELT

When you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you (1) have column values from both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows. (2) Avoid use of literal conditions to represent the results of one side of your join condition.
In other words, avoid this for your join condition:source1@movieId == '1'Instead, implement that with a dummy derived column.

There are several good tips in this post.

Comments closed

Loading Event Hubs from Cosmos DB

Published 2020-01-14 by Kevin Feasel

Annie Xu shows us how we can use Azure Functions to take data from Cosmos DB and populate Event Hubs:

One way to load data from Cosmos DB to Event hub is to use Azure Function. But although there is many coding samples out there to create such Azure Function. If you are like me do not have much application development experience, reading those code samples is bit channenging. Luckly, Azure Portal made is so easy.

Annie has a step-by-step walkthrough which makes it easy.

Comments closed

Azure Data Factory Notifications

Published 2019-12-31 by Kevin Feasel

Rayis Imayev walks us through three different techniques for sending notifications in Azure Data Factory:

While working on data integration projects and using Azure Data Factory as your main orchestration tool will help you to develop strategic forward thinking about your development tasks: to ponder on what your data sources are, point of destinations to land this information into a new data model and transformation steps to shape data from the source to its destination. Just like when you play chess and have to plan ahead several of your next moves.

Along with this structural thinking to develop and execute your data flows, timely notifications of when something goes left or right would give you additional peace of mind.

Something I appreciate in this post is that Rayis contrasts the Azure Data Factory techniques with SSIS methods, giving you a solid base for comparison.

Comments closed

Parsing ADF ARM Templates with T-SQL

Published 2019-12-26 by Kevin Feasel

Paul Andrew shows how you can use T-SQL to read an Azure Data Factory ARM template:

While documenting a customers data platform solution I decided it would be far easier if we could summarise the contents of a fairly complex Data Factory using its ARM Template. So, this is what I’ve done using T-SQL to parse the ARM Template JSON and output of series of tables containing details about the factory components.

That is quite the clever solution.

Comments closed

Wrapping Up Azure Data Factory

Published 2019-12-26 by Kevin Feasel

Cathrine Wilhelmsen wraps up a long series on Azure Data Factory with three final posts. First is lookups:

Lookups are similar to copy data activities, except that you only get data from lookups. They have a source dataset, but they do not have a sink dataset. (So, like… half a copy data activity? :D) Instead of copying data into a destination, you use lookups to get configuration values that you use in later activities.
And how you use the configuration values in later activities depends on whether you choose to get the first row only or all rows.

From there, it’s the bottom line question:

Congratulations! You’ve made it through my entire Beginner’s Guide to Azure Data Factory 🤓 We’ve gone through the fundamentals in the first 23 posts, and now we just have one more thing to talk about: Pricing.
And today, I’m actually going to talk! You see, in November 2019, I presented a 20-minute session at Microsoft Ignite about understanding Azure Data Factory pricing. And since it was recorded and the recording is available for free for everyone… Well, let’s just say that after 23 posts, I think we could both appreciate a short break from reading and writing

In case you missed anything, Cathrine has a summary and shows where you can learn a lot more:

After this, I will be taking a break from creating new content. However, I will continue to edit, update, tweak, rewrite, and improve all 25 posts already published. I originally published one post per day as an Azure Data Factory Advent Calendar, and even while writing I noticed things that I didn’t have time to cover or things that I wanted to go back and improve. But! I needed to get all the posts published first. I consider this the first edition of the series. Now, the editing begins. Then, I will do my best to keep the content updated as Azure Data Factory keeps evolving

This was a huge series; kudos to Cathrine for putting it all together.

Comments closed

Parameters, Variables, and ForEach Loops in ADF

Published 2019-12-23 by Kevin Feasel

Cathrine Wilhelmsen has a few more posts in the Azure Data Factory series for us. First up is on parameters:

We can build dynamic solutions!
Creating hardcoded datasets and pipelines is not a bad thing in itself. It’s only when you start creating many similar hardcoded resources that things get tedious and time-consuming. Not to mention, the risk of manual errors goes drastically up when you feel like you create the same resource over and over and over again.

After that is variables:

Parameters are external values passed into pipelines. They can’t be changed inside a pipeline. Variables, on the other hand, are internal values that live inside a pipeline. They can be changed inside that pipeline.
Parameters and variables can be completely separate, or they can work together. For example, you can pass a parameter into a pipeline, and then use that parameter value in a set variable or append variable activity.

And the latest post in the series is all about ForEach loops:

By default, the foreach loop tries to run as many iterations as possible in parallel. You can choose to run them sequentially instead, for example if you need to copy data into a single table and want to ensure that each copy finishes before the next one starts.
If you choose to run iterations in parallel, you can limit the number of parallel executions by setting the batch count. The default number is 20 and the max number is 50.

This has been a very nice series, and it looks like there is a little bit more to go.

Comments closed

Azure Data Factory Templates and Source Control

Published 2019-12-20 by Kevin Feasel

Cathrine Wilhelmsen continues a series on Azure Data Factory. First up is source control:

And yeah, I usually recommend that you set up source control early in your project, and not on day 18… However, it does require some external configuration, and in this series I wanted to get through the Azure Data Factory basics first. But by now, you should know enough to decide whether or not to commit to Azure Data Factory as your data integration tool of choice.

Next up is using the template gallery:

You can also create custom templates and share them with your team – or share them externally with others. Custom templates are saved in your code repository and will show up in the template gallery for you and your team. If you want to share them externally, you can easily export them, so others can import them in their Azure Data Factory.
Let’s take a look!

Read on to learn more.

Comments closed

DBLog: CDC for MySQL and Postgres

Published 2019-12-19 by Kevin Feasel

Andreas Andreakis and Ioannis Papapanagiotou announce a new change data capture tool for open source databases:

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. As transaction logs typically have limited retention, they aren’t guaranteed to contain the full history of changes. Therefore, dumps are needed to capture the full state of a source. There are several open source CDC projects, often using the same underlying libraries, database APIs, and protocols. Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks.
This motivated the development of DBLog, which offers log and dump processing under a generic framework. In order to be supported, a database is required to fulfill a set of features that are commonly available in systems like MySQL, PostgreSQL, MariaDB, and others.

It looks like DBLog is not open source just yet, but that’s forthcoming.

Comments closed

Recommendations for Implementing Azure Data Factory

Published 2019-12-19 by Kevin Feasel

Paul Andrew has a nice set of recommendations you should follow when configuring Azure Data Factory:

Building on our understanding of generic datasets, a good Data Factory should include (where possible) generic pipelines, these are driven from metadata to simplify (as a minimum) data ingestion operations. Typically I use an Azure SQLDB to house my metadata with stored procedures that get called via Lookup activities to return everything a pipeline needs to know.
This metadata driven approach means deployments to Data Factory for new data sources are greatly reduced and only adding new values to a database table is required. The pipeline itself doesn’t need to be complicated. Copying CSV files from a local file server to Data Lake Storage could be done with just three activities, shown below.

There are several good recommendations here; read the whole thing.

Comments closed

Data Copy & Package Execution in ADF

Published 2019-12-18 by Kevin Feasel

Cathrine Wilhelmsen continues a series on Azure Data Factory. First, we get to see how to copy data from on-prem SQL Servers:

In the previous post, we looked at the three different types of integration runtimes. In this post, we will first create a self-hosted integration runtime. Then, we will create a new linked service and dataset using the self-hosted integration runtime. Finally, we will look at some common techniques and design patterns for copying data from and into an on-premises SQL Server.
And when I say “on-premises”, I really mean “in a private network”. It can either be a SQL Server on-premises on a physical server, or “on-premises” in a virtual machine.

Then, we learn how to run SSIS packages in Azure Data Factory:

Two posts ago, we looked at the three types of integration runtimes and created an Azure integration runtime. In the previous post, we created a self-hosted integration runtime for copying SQL Server data. In this post, we will complete the integration runtime part of the series. We will look at what SSIS Lift and Shift is, how to create an Azure-SSIS integration runtime, and how you can start executing SSIS packages in Azure Data Factory.

I’m going to guess that the next post will be all about the third integration runtime.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31