Press "Enter" to skip to content

Category: ETL / ELT

From Azure Data Factory to Synapse Pipelines

Kevin Chant copies and pastes:

In this post I want to share an alternative way to copy an Azure Data Factory pipeline to Synapse Studio. Because I think it can be useful.

For those who are not aware, Synapse Studio is the frontend that comes with Azure Synapse Analytics. You can find out more about it in another post I did, which was a five minute crash course about Synapse Studio.

By the end of this post, you will know one way to copy objects used for an Azure Data factory pipeline to Synapse Studio. Which works as long as both are configured to use Git.

Click through to see how.

Comments closed

ETL via Powershell

Greg Moore builds a simple ETL process using Powershell:

Recently a customer asked me to work on a pretty typical project to build a process to import several CSV files into new tables in SQL Server. Setting up a PowerShell script to import the tables is a fairly simple process. However, it can be tedious, especially if the files have different formats. In this article, I will show you how building an ETL with PowerShell can save some time.

It’s a simple process, but that’s a good reminder that simple processes can be good processes.

Comments closed

Monitoring Azure Data Factory, Integration Runtimes, and Pipelines

Sandeep Arora monitors all the things:

For effective monitoring of ADF pipelines, we are going to use Log Analytics, Azure Monitor and Azure Data Factory Analytics. The above illustration shows the architectural representation of the monitoring setup.

The details of setting up log analytics, alerts and Azure Data Factory Analytics are further discussed in this section.

If you manage Azure Data Factory in your environment, give this a read.

Comments closed

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Moving Data from Confluent Cloud to Cosmos DB

Nathan Ham announces the Azure Cosmos DB sink connector in Confluent Cloud:

Today, Confluent is announcing the general availability (GA) of the fully managed Azure Cosmos DB Sink Connector within Confluent Cloud. Now, with just a few simple clicks, you can link the power of Apache Kafka® together with Azure Cosmos DB to set your data in motion.

Click through for a marketing-heavy look at how this works.

Comments closed

Exporting a Hive Table to CSV

The Hadoop in Real World team shows how you can export data from a Hive table specifically into a file using comma-separated values:

It is a pretty common use case to export the contents of a Hive table into a CSV file. It’s pretty simple if you are using a recent version of Hive. In this post, we will see who to achieve this with both newer and older versions of Hive.

Read on to see both versions of the answer.

Comments closed

Scaling ADF and Synapse Analytics Pipelines

Paul Andrew has a process for us:

Back in May 2020 I wrote a blog post about ‘When You Should Use Multiple Azure Data Factory’s‘. Following on from this post with a full year+ now passed and having implemented many more data platform solutions for some crazy massive (technical term) enterprise customers I’ve been reflecting on these scenario’s. Specifically considering:

– The use of having multiple regional Data Factory instances and integration runtime services.

– The decoupling of wider orchestration processes from workers.

Furthermore, to supplement this understanding and for added context, in December 2020 I wrote about Data Factory Activity Concurrency Limits – What Happens Next? and Pipelines – Understanding Internal vs External Activities. Both of which now add to a much clearer picture regarding the ability to scale pipelines for the purposes of large-scale extraction and transformation processes.

Read on for details about the scenario, as well as a design pattern to explain the process. This is a large solution for a large-scale problem.

Comments closed

Combining Change Data Capture with Azure Data Factory

Reitse Eskens continues a series on learning Azure Data Factory:

In my last blog, I pulled all the data from my table to my datalake storage. But, when data changes, I don’t want to perform a full load every time. Because it’s a lot of data, it takes time and somewhere down the line I’ll have to separate the changed rows from the identical ones. Instead of doing full loads every night or day or hour, I want to use a delta load. My pipeline should transfer only the new and changed rows. Very recently, Azure SQL DB finally added the option to enable Change Data Capture. This means after a full load, I can get the changed records only. And with changed records, it means the new ones, the updated ones and the deleted ones.

Let’s find out how that works.

Read on for the article and demonstration.

Comments closed

How Dynamic Data Masking Interacts with Bulk Copy (BCP)

Kenneth Fisher puts on a lab coat:

Hypothesis: If I have Dynamic Data Masking enabled on a column then when I use something like BCP to pull the data out it should still be masked.

I’m almost completely certain this will be the case but I had someone tell me they thought it would go differently, and since neither of us had actually tried this out it seemed like time for a simple experiment.

Click through for the experiment and its results.

Comments closed

Parameterizing ADF Pipelines

Reitse Eskens continues a series on learning Azure Data Factory:

In my previous blog I created the integration runtimes, and the linked services. However, we need to create new datasets. If you remember, and I don’t blame you if you don’t, the dataset I created contained a reference to a table. That’s nice, but this time we don’t want just one table, we want a number of tables.

Click thorugh to check it out.

Comments closed