Press "Enter" to skip to content

Category: ETL / ELT

Building an ARM Template for Azure Data Factory

Andy Leondard takes the first steps to building an Azure Data Factory pipeline using Azure Resource Manager Templates:

Azure Resource Manager, or ARM, “allows you to provision your applications using a declarative template.” So says the Azure Quickstart Templates page. ARM templates are JSON and allow administrators to import and export Azure resources using varying management patterns. I really like ARM templates for implementing infrastructure as code in Azure. In this post I show a very simple example of how to use ARM templates to export and then import a basic ADF (Azure Data Factory) pipeline.

The sample code doesn’t do that much by itself, but it does open up a new world of automation.

Comments closed

Minimal Logging with FastLoadContext

Paul White takes us through another way to perform minimally logged bulk loads with SQL Server:

This post provides new information about the preconditions for minimally logged bulk load when using INSERT...SELECT into indexed tables.

The internal facility that enables these cases is called FastLoadContext. It can be activated from SQL Server 2008 to 2014 inclusive using documented trace flag 610. From SQL Server 2016 onward, FastLoadContext is enabled by default; the trace flag is not required.

Without FastLoadContext, the only index inserts that can be minimally logged are those into an empty clustered index without secondary indexes, as covered in part two of this series. The minimal logging conditions for unindexed heap tables were covered in part one.

Click thorugh for a highly informative article.

Comments closed

Populating a Data Vault Model with Azure Data Factory

Rayis Imayev gives us an example of ELT into a Data Vault model using Azure Data Factory:

To make a full transition from the existing  DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).

Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.

Comments closed

Amazon Redshift ETL Tips

The Blendo team shares a few tips around ETL’ing data to Amazon Redshift:

2. The WLM Method
Use Amazon Redshift’s WLM (workload management) for defining a dedicated queue for the ETL process. Configuring the ETL queue with a small number of slots will help in avoiding excessive COMMITs. Also, avoid COMMITing separately for each transaction since commits are expensive.
Instead, surround multiple steps of the ETL process by a BEGIN…END statement. You can perform COMMIT only after all transformation logic is executed.

Click through for the set of tips.

Comments closed

Loading Data Into SnowflakeDB

Dan Bilsborough shows a couple ways of loading data into SnowflakeDB from Azure:

Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
– User stages, which each user will have by default
– Table stages, which each table will have by default
– Internal named stages, meaning staged within Snowflake

Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.

Read on to see what comes next.

Comments closed

Azure Data Factory: Mapping and Wrangling Data Flows

Cathrine Wilhelmsen explains the difference between Mapping Data Flows and Wrangling Data Flows in Azure Data Factory:

Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?

To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?

Read on to see how they both work.

Comments closed

Syncing Google Calendar and Outlook with Microsoft Flow

Eugene Meidinger shows how to keep a couple of calendars in sync:

Back when I worked a normal job, I had two calendars: Office 365 for work and Google for home. Now that I work for myself, that’s a lot more complicated. Sometimes a customer will create an account for me in their network. Sometimes I’ll partner with other consultants and work as part of their team. And of course, I’ve got my own work email at eugene@sqlgene.com.

I need all of these calendars to consolidate to one place. My natural inclination and personal preference is to put it all into Google. Now, there are sync apps available, but this sort of problem is a perfect use case. A calendar event is created in outlook, a flow is triggered, and that information is transferred to Google.

I was just complaining about this yesterday and then I see the post this morning. I’m pleased though simultaneously concerned that Eugene has bugged my hotel room.

Comments closed

Data Cleansing Options with Azure

James Serra tries to give you an answer of when you should use different Azure services for data cleansing:

Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option

Read on for additional options and James’s recommendations.

Comments closed

Extracting the First Element from an Array in ADF

Rayis Imayev shows how you can find the first element in an array using Azure Data Factory:

A user recently asked me a question on my previous blog post (Setting Variables in Azure Data Factory Pipelines) about possibility extracting the first element of a variable if this variable is set of elements (array).

So as a spoiler alert, before writing a blog post and adding a bit more clarity to the existing Microsoft ADF documentation, here is a quick answer to this question.

You’ll have to click through even for the quick answer.

Comments closed

ETL with Spark and Hive

Emrah Mete gives us an example of using Apache Spark for ETL into Apache Hive:

Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.

Comments closed