ETL / ELT – Page 25 – Curated SQL

In my last blog, I pulled all the data from my table to my datalake storage. But, when data changes, I don’t want to perform a full load every time. Because it’s a lot of data, it takes time and somewhere down the line I’ll have to separate the changed rows from the identical ones. Instead of doing full loads every night or day or hour, I want to use a delta load. My pipeline should transfer only the new and changed rows. Very recently, Azure SQL DB finally added the option to enable Change Data Capture. This means after a full load, I can get the changed records only. And with changed records, it means the new ones, the updated ones and the deleted ones.
Let’s find out how that works.

Read on for the article and demonstration.

Comments closed

How Dynamic Data Masking Interacts with Bulk Copy (BCP)

Published 2021-07-21 by Kevin Feasel

Kenneth Fisher puts on a lab coat:

Hypothesis: If I have Dynamic Data Masking enabled on a column then when I use something like BCP to pull the data out it should still be masked.
I’m almost completely certain this will be the case but I had someone tell me they thought it would go differently, and since neither of us had actually tried this out it seemed like time for a simple experiment.

Click through for the experiment and its results.

Comments closed

Parameterizing ADF Pipelines

Published 2021-07-19 by Kevin Feasel

Reitse Eskens continues a series on learning Azure Data Factory:

In my previous blog I created the integration runtimes, and the linked services. However, we need to create new datasets. If you remember, and I don’t blame you if you don’t, the dataset I created contained a reference to a table. That’s nice, but this time we don’t want just one table, we want a number of tables.

Click thorugh to check it out.

Comments closed

Executing Azure Data Factory Pipelines with Power App

Published 2021-06-30 by Kevin Feasel

Rayis Imayev has a plan:

One of my university professors liked to tell us a quote, “The Sleep of Reason Produces Monsters”, in a way to help us, his students, to stay active in our thinking process. I’m not sure if Francisco Goya, had a similar aspiration when he was creating his artwork with the same name.
So, let me explain my reasons to create a solution to trigger Azure Data Factory (ADF) pipelines from a Power App and why it shouldn’t be considered as a monster 🙂

If that’s not an introduction enticing enough to get you to read the whole thing, I don’t know what is.

Comments closed

From SQL Server to Excel via R

Published 2021-06-23 by Kevin Feasel

Kevin Wilkie wraps up a series on data movement between Excel and SQL Server:

In today’s post, we’ll go over how to export the data you have in SQL Server to Excel via one of my favorite computer languages – R. (Since we did have a post on how to Import data, it would seem rather rude not to have one on how to Export data.)
As always, you’ll need to open your R tool of choice. I tend to use RStudio but there are several out there that will accomplish this same goal.

Click through to see how.

Comments closed

Speeding Up Azure Data Factory Pipelines

Published 2021-06-01 by Kevin Feasel

Hiram Fleitas doesn’t have all day to wait for that pipeline to finish:

His issue was pretty much as mentioned on the tile. Our bank’s Azure Data Factory pipeline is running slow moving data from on-prem, we’re copying all tables in a SQL Server database, files from ftp sites and network share drives to Azure SQL DB Managed Instance and to blob storage (our datalake) , do you have some recommendations how to make it go faster? Its around 300GBs and takes over 8 hrs.
So I replied with the following and figured to post it here as it may help others.

Hiram has a video, as well as specific advice to offer.

1 Comment

Staging Your Data with ETL

Published 2021-03-31 by Kevin Feasel

Martin Schoombee provides some advice on creating ETL processes:

The concept of staging is not a complicated one, but you shouldn’t be deceived by the apparent simplicity of it. There’s a lot you can do in this phase of your ETL process, and it is as much a skill as it is an art to get it all right and still appear simplistic.
I have a few primary objectives when designing a staging process: Efficiency, Modularity, Recoverability & Traceability. Let’s take a closer look at each one of these, and some ideas & good practices that will help you achieve it…

Click through for several tips around each of those points.

Comments closed

Limitations with Control Flows in Azure Data Factory

Published 2021-03-29 by Kevin Feasel

Meagan Longoria has a list:

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.
1. You cannot nest For Each activities.
Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.

Click through for several other limitations and workarounds.

Comments closed

Loading Azure Synapse Analytics using PolyBase

Published 2021-03-19 by Kevin Feasel

Gauri Mahajan needs to load some data:

Azure Synapse Analytics is Microsoft’s data warehousing offering on Azure Cloud. It supports three types of runtimes – SQL Serverless Pool, SQL Dedicated Pool, and Spark Pools. As there are a variety of data sources on Azure, it’s very obvious that there can be varying types and volumes of data that would have to be loaded into Azure Synapse pools. There are three major types of data ingestion approaches that can be used to load data into Synapse. The COPY command is the most flexible and elaborate mechanism, where someone can execute this command from a SQL pool to load data from supported data repositories. This command is convenient to load ad-hoc and small to medium-sized data loads into Synapse. The second method of loading data is the Bulk Insert, where the method name is self-relevant regarding the approach functionality. To ingest the data from supported repositories into dedicated SQL pools, PolyBase is as efficient and at times it’s even more efficient than the COPY command. This article will help you understand the process to ingest data into Azure Synapse Analytics using PolyBase to load the data.

Click through for the process.

Comments closed

Creating Parquet Files from SQL Server Data

Published 2021-03-16 by Kevin Feasel

Andy Leonard answers a challenge:

I searched and found some promising Parquet SSIS components available from CData Software and passed that information along. I shared my inexperience in exporting to parquet format and asked a few friends how they’d done it.
I thought: How many times have I demonstrated Azure Data Factory and clicked right past file format selection without giving Parquet a second thought? Too many times. It was time to change that.

Another route is to use PolyBase. If you’re okay with writing the results to Azure Blob Storage, you can insert directly into Parquet files the results of a SQL query. If that sounds interesting, here are posts on connecting to Azure Blob Storage via PolyBase and inserting into Azure Blob Storage. I insert in CSV format to make it easier for people to follow, but swap the file format with Parquet and it works all the same.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Category: ETL / ELT

Combining Change Data Capture with Azure Data Factory

How Dynamic Data Masking Interacts with Bulk Copy (BCP)

Parameterizing ADF Pipelines

Executing Azure Data Factory Pipelines with Power App

From SQL Server to Excel via R

Speeding Up Azure Data Factory Pipelines

Staging Your Data with ETL

Limitations with Control Flows in Azure Data Factory

Loading Azure Synapse Analytics using PolyBase

Creating Parquet Files from SQL Server Data