Press "Enter" to skip to content

Category: ETL / ELT

Building Metadata for an ADF Pipeline

Paul Andrew continues a series on Azure Data Factory and metadata-driven pipelines:

Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.

Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.

Soon we will get to see the Azure Data Factory power in action.

Comments closed

Incremental Imports with Sqoop

Jon Morisi continues a series on Sqoop:

In my last two blog posts I walked through how to use Sqoop to perform full imports.  Nightly full imports with overwrite has it’s place for small tables like dimension tables.  However, in real-world scenarios you’re also going to want a way to import only the delta values since the last time an import was run.  Sqoop offers two ways to perform incremental imports: append and lastmodified.

Both incremental imports can be run manually or created as job using the “sqoop job” command.  When running incremental imports manually from the command line the “–last-value” arg is used to specify the reference value for the check-column.  Alternately sqoop jobs track the “check-column” in the job and the value of the check-column is used for subsequent job runs as the where predicate in the SQL statement.  I.E. select columns from table where check-column > (last-max-check-column-value).

This is where Sqoop starts to break down for me, and Jon lists some of the issues in the post.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed

Using a Spark Listener

Bipin Patwardhan shares with us an event ingestion engine for Apache Spark:

In the last quarter of 2019, I developed a meta-data driven, ingestion engine using Spark. The framework /library has multiple patterns to cater to multiple source and destination combinations. For example, two patterns are available for loading flat files to cloud storage (one to load data to AWS S3 and another to load data to Azure Blob).

As data loading philosophies have changed from Extract-Transform-Load (ETL) to Extract-Load-Transform (ETL), such a framework is very useful, as it reduces the time needed to set up ingestion jobs.

Is anyone else getting Integration Services or Informatica flashbacks? Because I sure am.

Comments closed

Using Sqoop to Move Data into Hive

Jon Morisi continues a series on Sqoop:

Sqoop completes the import task by running MapReduce jobs importing the data to HDFS, and then running Hive commands (CREATE TABLE / LOAD DATA INPATH) to move the data to Hive.  The default HDFS location is: /user/[login]/[TABLENAME].  If you have any issues during the import you may need to remove the HDFS directory prior to re-running, or else you will get an Error:

Read on for sample calls and additional notes.

Comments closed

Streaming Pipelines in AWS with Flink and Kinesis Data Analytics

Steffen Hasumann shows us how to put together a streaming ETL pipeline in AWS using Apache Flink and Amazon Kinesis Data Analytics:

The remainder of this post discusses how to implement streaming ETL architectures with Apache Flink and Kinesis Data Analytics. The architecture persists streaming data from one or multiple sources to different destinations and is extensible to your needs. This post does not cover additional filtering, enrichment, and aggregation transformations, although that is a natural extension for practical applications.

This post shows how to build, deploy, and operate the Flink application with Kinesis Data Analytics, without further focusing on these operational aspects. It is only relevant to know that you can create a Kinesis Data Analytics application by uploading the compiled Flink application jar file to Amazon S3 and specifying some additional configuration options with the service. You can then execute the Kinesis Data Analytics application in a fully managed environment. For more information, see Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications and the Amazon Kinesis Data Analytics developer guide.

Click through for the walkthrough.

Comments closed

Executing Azure Data Factory Pipelines with Azure Functions

Paul Andrew wants to execute an Azure Data Factory pipeline via an Azure Function call:

For the function itself, hopefully this is fairly intuitive once you’ve created your DataFactoryManagementClient and authenticated.

The only thing to be careful of is not using the CreateOrUpdateWithHttpMessagesAsync method by mistake. Make sure its Create Run. Sounds really obvious, but when you get code drunk names blur together and the very different method overloads will have you confused for hours!…. According to a friend 🙂

Read the whole thing.

Comments closed

Using Sqoop to Import Data into HDFS

Jon Morisi has a primer on Sqoop:

In this article, I’ll walk through using Sqoop to import data to Hadoop (HDFS).

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”

With respect to SQL Server, Sqoop has two good use cases: pulling data from SQL Server into HDFS, and pulling data from HDFS into a staging table in SQL Server.

Comments closed

Resource Limitations with Azure Data Factory

Paul Andrew has a public service announcement for us:

As far as I can tell Microsoft do an excellent job at managing data centre capacity so I completely understand the reason for having limitations on resources in place. There is no such thing as a limitless cloud platform.

Note; in a lot of cases (as you’ll see in the below table for Data Factory) the MAX limitations are only soft restrictions that can easily be lifted via a support ticket. Please check before raising alerts and project risks.

Click through for the limits, and “contact support” definitely is good advice if you’re expecting to push past those limits.

Comments closed

Parameterizing a Data Factory Linked Service to a REST API

Meagan Longoria had to parameterize a linked service connecting to a REST API recently:

In order to pass dynamic values to a linked service, we need to parameterize the linked service, the dataset, and the activity.

I have a pipeline where I log the pipeline start to a database with a stored procedure, lookup a username in Key Vault, copy data from a REST API to data lake storage, and log the end of the pipeline with a stored procedure. My username and password are stored in separate secrets in Key Vault, so I had to do a lookup with a web activity to get the username. The password is retrieved using Key Vault inside the linked service. Data Factory doesn’t currently support retrieving the username from Key Vault so I had to roll my own Key Vault lookup there.

Click through for the instructions.

Comments closed