If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part
The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.
Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.
Airflow has been getting some hype lately, especially in the AWS space.
This actually happened to me in a previous job. We had a fairly complex ETL solution for the most critical part of our Data Warehouse. Many SSIS packages, views, and stored procedures queried the tables that were replicas of the source system tables. One day, we found out that the source system would be deploying a new version of their database the following day. In every single table, some columns were removed, others added, and many changed data types.
There was no way that we could manually update all our SSIS packages, views, and stored procedures in less than a day. Thousands of users depended on our solution. It was too late to pause the source system changes.
That story ends up with a happy ending.
I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.
In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.
My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.
As is usual for Phil, this article is done quite well.
This allows you to detect changes in a lightweight manner via the Transaction Log in SQL Server in combination with T-SQL. Change Data Capture is more about auditing or creating a historical view and Temporal Tables are the next step up from there which became available in 2016 versions of SQL Server. Change Tracking is primarily used for finding only things that have changed. Not necessarily why, how, or who changed it, but what has changed and what it is now.
So why would you want this technology implemented? I find this technology is best suited for tasks where I want as light of a footprint as possible and I want to bring over incremental changes.
Click through for a long and complete walkthrough. If you’re thinking to implement change tracking, this is a good link to check out.
To build a robust BI system, you need to cater for errors and handle errors carefully. If you build a reporting solution that the refresh of that fails everytime an error occurs, it is not a robust system. Errors can happen by many reasons, In this post, I’ll show you a way to catch potential errors in Power Query and how to build an exception report page to visualize the error rows for further investigation. The method that you learn here, will save your model from failing at the time of refresh. Means you get the dataset updated, and you can catch any rows caused the error in an exception report page. To learn more about Power BI, read Power BI book from Rookie to Rock Star.
There’s a lot of work, but also a lot of value in doing that work.
Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include:
- Connectors – the JAR files that define how to integrate with the data store itself
- Converters – handling serialization and deserialization of data
- Transforms – optional in-flight manipulation of messages
One of the more frequent sources of mistakes and misunderstanding around Kafka Connect involves the serialization of data, which Kafka Connect handles using converters. Let’s take a good look at how these work, and illustrate some of the common issues encountered.
Read on for a good overview of the topic.
Chris Webb has started a series on calling Power BI’s REST API from Microsoft Flow. In Part 1, he creates a custom connector:
Playing around with Microsoft Flow recently, I was reminded of the following blog post from a few months ago by Konstantinos Ioannou about using Flow to call the Power BI REST API to refresh a dataset:
I was impressed by this post when I read it, but don’t think I understood quite how many exciting possibilities this technique opens up for Power BI users until I started to use it myself. The Power BI dev team are making a big investment in the API yet most Power BI users, myself included, are not developers and can’t easily write code (or PowerShell scripts) to call the API. With Flow, however, you can use the API without writing any code at all and solve a whole series of common problems easily. In this series of blog posts I’m going to show a few examples of this.
For a while now I’ve had an idea stuck in my head: wouldn’t it be cool to build a Power BI solution where a user could enter data into an Excel workbook and then, as soon as they had done so, they could see their new data in a Power BI report? It would be really useful for planning/budgeting applications and what-if analysis. I had hoped that a DirectQuery model using the CData Excel custom connector (mentioned here) might work but the performance wasn’t good enough; using Flow with the Power BI REST API (see Part 1 of this series for details on how to get this set up) gets me closer to my goal, even if there’s still one major problem with the approach. Here’s how…
Read on for the approach as well as the major problem.
For generating SSIS packages, you need to rely on Biml (much about that can be found on this blog or on the net), or older frameworks such as ezApi. Or you need 3rd party tools such as BimlStudio or TimeXtender. Using Biml means writing XML and .NET. Don’t get me wrong, I love Biml and I use it a lot in my SSIS projects.
But generating transformations in PDI is so much easier. First, you create a template (you create a transformation, but you leave certain fields empty, such as the source SQL statement and the destination table). Then you have another transformation reading metadata. This metadata is pushed to the template using the Metadata Injection Transformation. In this transformation, you point to the template and you map those empty fields to your metadata fields.
It’s interesting to see where each product stands out or falls flat compared to the other, and Koen’s comparison is definitely not a one-sided bout.
So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?
To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?
All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.