Azure Data Factory (ADF) provides you with a framework for creating data transformation solutions in the Microsoft cloud environment. Recently introduced Template Gallery for ADF pipelines can speed up this development process and provide you with helpful information to create additional activity tasks in your pipelines.
We naturally long to seek if something standard can be further adjusted. That custom design is like ordering a regular pizza and then hitting the “customize” button in order to add a few toppings of our choice. It would be very impressive then to save this customized “creation” for future ordering. And Azure Data Factory has a similar option to save your custom data transformation solutions (pipelines) as templates and reuse them later.
Click through to see how you can do just that.
The OLTP table implements a
rowversioncolumn that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The
dbo.OLTP_Updatestable is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.
I think this is the first time I’ve seen someone use
ROWVERSION types successfully.
Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.
And prior to this point, all my sample ADF pipelines were developed in so-called “Live Data Factory Mode” using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn’t been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.
Click through for a detailed demo.
2) Create Databricks Service
Yes you are reading this correctly. Under the hood Data Factory is using Databricks to execute the Data flows, but don’t worry you don’t have to write code.
Create a Databricks Service and choose the right region. This should be the same as your storage region to prevent high data movement costs. As Pricing Tier you can use Standard for this introduction. Creating the service it self doesn’t cost anything.
Joost shows the work you have to do to build out a data flow. This has been a big hole in ADF—yeah, ADF seems more like an ELT tool than an ETL tool but even within that space, there are times when you need to do a bit more than pump-and-dump.
If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part
The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.
Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.
Airflow has been getting some hype lately, especially in the AWS space.
This actually happened to me in a previous job. We had a fairly complex ETL solution for the most critical part of our Data Warehouse. Many SSIS packages, views, and stored procedures queried the tables that were replicas of the source system tables. One day, we found out that the source system would be deploying a new version of their database the following day. In every single table, some columns were removed, others added, and many changed data types.
There was no way that we could manually update all our SSIS packages, views, and stored procedures in less than a day. Thousands of users depended on our solution. It was too late to pause the source system changes.
That story ends up with a happy ending.
I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.
In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.
My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.
As is usual for Phil, this article is done quite well.
This allows you to detect changes in a lightweight manner via the Transaction Log in SQL Server in combination with T-SQL. Change Data Capture is more about auditing or creating a historical view and Temporal Tables are the next step up from there which became available in 2016 versions of SQL Server. Change Tracking is primarily used for finding only things that have changed. Not necessarily why, how, or who changed it, but what has changed and what it is now.
So why would you want this technology implemented? I find this technology is best suited for tasks where I want as light of a footprint as possible and I want to bring over incremental changes.
Click through for a long and complete walkthrough. If you’re thinking to implement change tracking, this is a good link to check out.
To build a robust BI system, you need to cater for errors and handle errors carefully. If you build a reporting solution that the refresh of that fails everytime an error occurs, it is not a robust system. Errors can happen by many reasons, In this post, I’ll show you a way to catch potential errors in Power Query and how to build an exception report page to visualize the error rows for further investigation. The method that you learn here, will save your model from failing at the time of refresh. Means you get the dataset updated, and you can catch any rows caused the error in an exception report page. To learn more about Power BI, read Power BI book from Rookie to Rock Star.
There’s a lot of work, but also a lot of value in doing that work.