Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include:
- Connectors – the JAR files that define how to integrate with the data store itself
- Converters – handling serialization and deserialization of data
- Transforms – optional in-flight manipulation of messages
One of the more frequent sources of mistakes and misunderstanding around Kafka Connect involves the serialization of data, which Kafka Connect handles using converters. Let’s take a good look at how these work, and illustrate some of the common issues encountered.
Read on for a good overview of the topic.
Chris Webb has started a series on calling Power BI’s REST API from Microsoft Flow. In Part 1, he creates a custom connector:
Playing around with Microsoft Flow recently, I was reminded of the following blog post from a few months ago by Konstantinos Ioannou about using Flow to call the Power BI REST API to refresh a dataset:
I was impressed by this post when I read it, but don’t think I understood quite how many exciting possibilities this technique opens up for Power BI users until I started to use it myself. The Power BI dev team are making a big investment in the API yet most Power BI users, myself included, are not developers and can’t easily write code (or PowerShell scripts) to call the API. With Flow, however, you can use the API without writing any code at all and solve a whole series of common problems easily. In this series of blog posts I’m going to show a few examples of this.
For a while now I’ve had an idea stuck in my head: wouldn’t it be cool to build a Power BI solution where a user could enter data into an Excel workbook and then, as soon as they had done so, they could see their new data in a Power BI report? It would be really useful for planning/budgeting applications and what-if analysis. I had hoped that a DirectQuery model using the CData Excel custom connector (mentioned here) might work but the performance wasn’t good enough; using Flow with the Power BI REST API (see Part 1 of this series for details on how to get this set up) gets me closer to my goal, even if there’s still one major problem with the approach. Here’s how…
Read on for the approach as well as the major problem.
For generating SSIS packages, you need to rely on Biml (much about that can be found on this blog or on the net), or older frameworks such as ezApi. Or you need 3rd party tools such as BimlStudio or TimeXtender. Using Biml means writing XML and .NET. Don’t get me wrong, I love Biml and I use it a lot in my SSIS projects.
But generating transformations in PDI is so much easier. First, you create a template (you create a transformation, but you leave certain fields empty, such as the source SQL statement and the destination table). Then you have another transformation reading metadata. This metadata is pushed to the template using the Metadata Injection Transformation. In this transformation, you point to the template and you map those empty fields to your metadata fields.
It’s interesting to see where each product stands out or falls flat compared to the other, and Koen’s comparison is definitely not a one-sided bout.
So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?
To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?
All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.
This sounds similar to SSIS precedence constraints, but there are a couple of big differences.
- SSIS allows us to define expressions to be evaluated to determine if the next task should be executed.
- SSIS allows us to choose whether we handle multiple constraints as a logical AND or a logical OR. In other words, do we need all constraints to be true or just one.
ADF V2 activity dependencies are always a logical AND. While we can design control flows in ADF similar to how we might design control flows in SSIS, this is one of several differences. Let’s look at an example.
Meagan gives us three methods of replicating SSIS functionality using ADF V2, so check it out.
In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.
- Extract: Data is available from a source system
- Transform: We want to filter, cleanse or otherwise enrich this source data
- Load: Make the data available to another application
There are two key concepts here:
- Data is created by an application, and we want it to be available to other applications
- We often want to process the data (for example, cleanse and apply business logic to it) before it is used
Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?
Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now. The set of tools expand, but the nature of the job remains similar.
The next consideration is a bit more involved if you are new to data integration. Both of these tools excel at transporting data from place to place, but they have important differences in terms of what you can do to modify the data in transit. As a matter of emphasis, ADF has more features geared toward moving the data than performing any complex transformation along the way. SSIS, on the other hand, was built with a large library of transformations that you can chain together to make elaborate data flows including lookups, matching, splitting data, and more.
The tools also overlap quite a lot. In projects this seems to lead to the question of whether you’ll transform the data “in flight” using Extract Transform Load (ETL), or instead move the data to a destination where it’ll be transformed using Extract Load Transform (ELT).
These are not “pretty much the same thing” and Merrill does a good job of explaining what those differences in design mean for the products.
Our main requirement for this new project was to build infrastructure that would be 100 percent self-service for our Data Scientists. In other words, my teammates and I would never be directly involved in the discovery, creation, configuration and management of the event data. Self-service would fix the primary shortcoming of our legacy event delivery system: manual administration that was performed by my team whenever a new dataset was born. This manual process hindered the productivity and access to event data for our Data Scientists. Meanwhile, fulfilling the requests of the Data Scientists hindered our own ability to improve the infrastructure. This scenario is exactly what the Data Platform Team strives to avoid. Building self-service tooling is the number one tenet of the Data Platform Team at Stitch Fix, so whatever we built to replace the old event infrastructure needed to be self-service for our Data Scientists. You can learn more about our philosophy in Jeff Magnusson’s post Engineers Shouldn’t Write ETL.
This is an architectural overview and a good read.
In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas. This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data). You can also extend the CDM by creating custom entities. Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.
A dataflow is not just the data itself, but also logic on how the data is manipulated. Dataflows belong to the Data Warehouse/Mart/Lake family. Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint. Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.
Also check out the comments for some clarification on why you’d want to use Dataflows rather than doing the work directly in the data lake.
With ADF V2 you get a browser-based interface using drag and drop technology; V1 was primarily done in the Visual Studio IDE. It also added triggers for scheduling, so you can schedule your jobs when required and in additional ways (which I’ll discuss further in a bit).
Some other features of ADF V2 that came out as it became generally available:
Lift and Shift operations for your SSIS packages, so if you have SSIS packages local, you can now Lift and Shift those into compute with the integration runtime service in Data Factory.
This also allows for cloud to cloud, cloud to prem, prem to prem and some third-party tools are supported within that as well.
Control flow activities like link branching, looping, conditional execution and parameterization.
Integration with HD Spark and Databricks for big data workloads and data science.
There’s more where that came from, too.