Press "Enter" to skip to content

Category: ETL / ELT

Exporting a Hive Table to CSV

The Hadoop in Real World team shows how you can export data from a Hive table specifically into a file using comma-separated values:

It is a pretty common use case to export the contents of a Hive table into a CSV file. It’s pretty simple if you are using a recent version of Hive. In this post, we will see who to achieve this with both newer and older versions of Hive.

Read on to see both versions of the answer.

Comments closed

Scaling ADF and Synapse Analytics Pipelines

Paul Andrew has a process for us:

Back in May 2020 I wrote a blog post about ‘When You Should Use Multiple Azure Data Factory’s‘. Following on from this post with a full year+ now passed and having implemented many more data platform solutions for some crazy massive (technical term) enterprise customers I’ve been reflecting on these scenario’s. Specifically considering:

– The use of having multiple regional Data Factory instances and integration runtime services.

– The decoupling of wider orchestration processes from workers.

Furthermore, to supplement this understanding and for added context, in December 2020 I wrote about Data Factory Activity Concurrency Limits – What Happens Next? and Pipelines – Understanding Internal vs External Activities. Both of which now add to a much clearer picture regarding the ability to scale pipelines for the purposes of large-scale extraction and transformation processes.

Read on for details about the scenario, as well as a design pattern to explain the process. This is a large solution for a large-scale problem.

Comments closed

Combining Change Data Capture with Azure Data Factory

Reitse Eskens continues a series on learning Azure Data Factory:

In my last blog, I pulled all the data from my table to my datalake storage. But, when data changes, I don’t want to perform a full load every time. Because it’s a lot of data, it takes time and somewhere down the line I’ll have to separate the changed rows from the identical ones. Instead of doing full loads every night or day or hour, I want to use a delta load. My pipeline should transfer only the new and changed rows. Very recently, Azure SQL DB finally added the option to enable Change Data Capture. This means after a full load, I can get the changed records only. And with changed records, it means the new ones, the updated ones and the deleted ones.

Let’s find out how that works.

Read on for the article and demonstration.

Comments closed

How Dynamic Data Masking Interacts with Bulk Copy (BCP)

Kenneth Fisher puts on a lab coat:

Hypothesis: If I have Dynamic Data Masking enabled on a column then when I use something like BCP to pull the data out it should still be masked.

I’m almost completely certain this will be the case but I had someone tell me they thought it would go differently, and since neither of us had actually tried this out it seemed like time for a simple experiment.

Click through for the experiment and its results.

Comments closed

Parameterizing ADF Pipelines

Reitse Eskens continues a series on learning Azure Data Factory:

In my previous blog I created the integration runtimes, and the linked services. However, we need to create new datasets. If you remember, and I don’t blame you if you don’t, the dataset I created contained a reference to a table. That’s nice, but this time we don’t want just one table, we want a number of tables.

Click thorugh to check it out.

Comments closed

Executing Azure Data Factory Pipelines with Power App

Rayis Imayev has a plan:

One of my university professors liked to tell us a quote, “The Sleep of Reason Produces Monsters”, in a way to help us, his students, to stay active in our thinking process. I’m not sure if Francisco Goya, had a similar aspiration when he was creating his artwork with the same name.

So, let me explain my reasons to create a solution to trigger Azure Data Factory (ADF) pipelines from a Power App and why it shouldn’t be considered as a monster 🙂

If that’s not an introduction enticing enough to get you to read the whole thing, I don’t know what is.

Comments closed

From SQL Server to Excel via R

Kevin Wilkie wraps up a series on data movement between Excel and SQL Server:

In today’s post, we’ll go over how to export the data you have in SQL Server to Excel via one of my favorite computer languages – R. (Since we did have a post on how to Import data, it would seem rather rude not to have one on how to Export data.)

As always, you’ll need to open your R tool of choice. I tend to use RStudio but there are several out there that will accomplish this same goal.

Click through to see how.

Comments closed

Speeding Up Azure Data Factory Pipelines

Hiram Fleitas doesn’t have all day to wait for that pipeline to finish:

His issue was pretty much as mentioned on the tile. Our bank’s Azure Data Factory pipeline is running slow moving data from on-prem, we’re copying all tables in a SQL Server database, files from ftp sites and network share drives to Azure SQL DB Managed Instance and to blob storage (our datalake) , do you have some recommendations how to make it go faster? Its around 300GBs and takes over 8 hrs.

So I replied with the following and figured to post it here as it may help others.

Hiram has a video, as well as specific advice to offer.

1 Comment

Staging Your Data with ETL

Martin Schoombee provides some advice on creating ETL processes:

The concept of staging is not a complicated one, but you shouldn’t be deceived by the apparent simplicity of it. There’s a lot you can do in this phase of your ETL process, and it is as much a skill as it is an art to get it all right and still appear simplistic.

I have a few primary objectives when designing a staging process: Efficiency, Modularity, Recoverability & Traceability. Let’s take a closer look at each one of these, and some ideas & good practices that will help you achieve it…

Click through for several tips around each of those points.

Comments closed

Limitations with Control Flows in Azure Data Factory

Meagan Longoria has a list:

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

1. You cannot nest For Each activities.
Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.

Click through for several other limitations and workarounds.

Comments closed