Press "Enter" to skip to content

Category: ETL / ELT

A Brief Overview of 21 ETL Tools in Python

Adron Hall makes a list:

Here are summaries of each of the tools you’ve mentioned along with examples of how to implement the ETL (Extract, Transform, Load) process using each tool within a Python workflow:

  1. Apache Spark: Apache Spark is a powerful open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It’s commonly used for processing large-scale data and running complex ETL pipelines. Example Implementation:

Read on for summaries and samples for each of the 21 options.

Comments closed

Storing Log Analytics Queries in Azure Blob Storage

Gilbert Quevauvilliers wants some long-term storage:

Following on in my series, in this blog post I am going to demonstrate how to store Log Analytics Queries in Blob Storage.

This allows me to be able to store the Power BI Queries externally from Log Analytics and to have an easy way to get the data into my Fabric Lake house in later steps. To do this I am going to use a Logic App in Azure.

In this series I am going to show you all the steps I did to have the successful outcome I had with my client.

Read on to see what Gilbert used for the task.

Comments closed

Lessons Learned from Azure Data Factory Integrating with DB/2 on Mainframe

Teo Lachev shares some thoughts:

I’ve done a few BI integration projects extracting data from ERPs running on IBM Db2. Most of the implementations would use a hybrid architecture where the ERP would be running on an on-prem mainframe while the data was loaded in Microsoft Azure. Here are a few tips if you’re facing this challenge:

Click through for five major points. Surprisingly, one of them isn’t “Avoid DB/2 like the plague.”

Comments closed

Diving into the Microsoft Fabric Copy Activity

Reza Rad does more than copies:

Copy Activity is one of the most commonly used activities in Microsoft Fabric’s Data Factory Pipeline. The Copy Activity copies the data from a source to a destination. However, there is more to that rather than just a simple copy. In this article, you will learn what Copy Activity is, its rationale, how it works, and its configuration options.

Reza has a video, as well as a demo-heavy full-length article on the topic.

Comments closed

Batch File Importation in SQL Server

Paul White loads things quickly:

All this can be achieved with client-side tools and programming. It can also be done server-side by importing the raw data into a staging table before processing using T-SQL procedures.

Other times, the need arises to ingest data without using client-side tools and without making a complete copy of the raw data on the server. This article describes one possible approach in that situation.

Read on for the process.

Comments closed

Using the Microsoft Fabric Data Gateway

Reitse Eskens uploads some data:

In a blog from a few weeks ago, I wrote about getting data from your on-prem SQL Server into Fabric. At the time, the only option for a copy dataflow was using a direct connection over the internet. It still is, but now you can also use the PowerBI Data Gateway to get data from your SQL Server into Fabric.

In this blog, I’ll take you through the steps needed and an issue I ran into.

Read on for Reitse’s instructions and how to avoid the issue he ran into.

Comments closed

Migrating Cosmos DB Tables API

Eitan Blumin handles a migration:

A few months ago, I was involved in an interesting project where a large customer (not to be named due to NDA) needed to migrate their entire Azure cloud subscription to another subscription. This was a difficult and arduous process that involved several PaaS technologies, besides SQL Server, that I didn’t have experience with before.

But it presented very interesting challenges and opportunities to learn new things.

One of these was the need to migrate an entire Azure Cosmos DB with Table Storage API account from one subscription to another.

Read on for the challenge, the intermediate solution using the Cosmos DB Data Migration Tool, and Eitan’s Powershell script to automate the process. I know and work with most of the people working on the DMT and they’re good folks.

Comments closed

Updates to Change Data Capture in ADF

Chen Hirsh looks at some updates:

A few months ago I wrote a post about the new feature of change data capture (CDC) on Azure data factory (ADF) – https://www.madeiradata.com/post/the-wind-of-change-change-data-capture-in-data-factory

Change data capture, as the name suggests, gets the data changes on one system, and replicates them to another. Since this is a task that data engineers do a lot, this was a very welcome addition to ADF.

In this post, we’ll explore what is new on this front.

Click through for what’s new, though do be cognizant of which items are in GA and which are still in preview.

Comments closed

Using the Azure Data Factory Self-Hosted Integration Runtime

Chen Hirsh hosts a runtime:

In Azure data factory (ADF), An integration runtime is a compute resource to run your pipelines on. When you run an application on your computer, it uses the computer resources, such as CPU and memory, to run its tasks. When you run activities in a pipeline in ADF, they also need resources to do their job, like copying data or writing a file, and these are provided by the integration runtime.

When you create an instance of ADF, you get a default integration runtime, hosted in the same region that you created ADF in. If you need, you can add your own integration runtimes, either on Azure, or you can download and install a self-hosted integration runtime (SHIR) on your own server.

Read on to understand when you would want to use a self-hosted integration runtime and the process to do so. This SHIR also applies to Synapse pipelines and is one of the few ways to move data out of a Synapse workspace with data exfiltration protection enabled.

Comments closed