The solution picks up the SQL data changes from the CDC Change Tracking system tables, creates JSON messages from the change rows, and then posts the message to an Azure Event Hub. Once landed in the Event Hub an Azure Stream Analytics (ASA) Job distributes the changes into the multiple outputs.
What I found pretty cool was that I could transmit SQL delta changes from source to target in as little as 5 seconds end to end!
There are a bunch of steps, but the end result is worth it.
Let’s consider an application that does some real-time stateful stream processing with the Kafka Streams API. We’ll run through a specific example of the end-to-end reference architecture and show you how to:
Run a Kafka source connector to read data from another system (a SQLite3 database), then modify the data in-flight using Single Message Transforms (SMTs) before writing it to the Kafka cluster
Process and enrich the data from a Java application using the Kafka Streams API (e.g. count and sum)
Run a Kafka sink connector to write data from the Kafka cluster to another system (AWS S3)
Read the whole thing.
Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.
However, a wide class of problems in analytics, relevance, and graph processing have a rather curious shape of widening in the middle before slimming down (right panel in the figure above). It gets worse before it gets better.
In this article, we take a deeper dive into this exploding middle shape: understanding why it happens, why it’s a problem, and what can we do about it. We share our experiences of real-life workflows from a spectrum of fields, including Analytics (A/B experimentation), Relevance (user-item feature scoring), and Graph (second degree network/friends-of-friends).
The examples relate directly to Hadoop, but are applicable in other data platforms as well.
So this all sounds very promising as a way of tracking changes to our Data Warehouse data, for purposes such as extracting deltas, inserts and updates to Type I and II dimensions and so forth. It doesn’t have any show-stopping overhead for the hashing operations for the sizes of data typically encountered and storage isn’t going to be an issue. It is native to T-SQL so we can rerun our hash value generation in the engine where our data resides rather than having to push through SSIS or some other tool to generate this for us. Algorithms are universal and as such will give us the same values wherever used for the same bytes of input. Let’s go back to the basic idea for a minute and consider how we implement this.
This is particularly useful in cases where you have metadata columns you don’t much care about (e.g., last modified time). I do recommend using CONCAT or CONCAT_WS (if you’re on SQL Server 2017) to do string concatenation, though; it’d remove the need for util.CastAsNVarchar and possibly more.
This post talks about the issue I ran into with SSIS Mapping Files.
We currently run DB2 on an IBM iSeries AS400 for our ERP system. I was tasked with copying data from the AS400 to a SQL Server database for some consultants to use. The C-Suite didn’t want to give the consultants access to our AS400, so this was the work around that was put forth and accepted (and no, no one asked me before I was “voluntold” for the task). Since this would essentially be a “one-time” thing, I chose to use the Import Export Wizard, but I would save the package just in case they wanted this process repeated.
Sounds like it was a painful experience, but it does have a happy ending.
Once the Script Component has been defined as a source, the output columns can be defined. For this post, the same USGS Earthquake data that was used in the “Download JSON data with PowerShell” post will serve as an example. Be careful to choose the correct data types here. This can be tedious because you have to choose the correct data types in the C# code as well, and ensure that they correspond with the SSIS types. It’s helpful to bookmark a SSIS data type translation table for reference.
It does involve creating a script component, but aside from the tedium that Chris mentions, it’s not too bad.
I prefer to passively gather performance metrics – even if it’s a little convoluted and more work (for me). I don’t often need these metrics immediately available, so I execute queries similar to the query below and store the performance metrics in a table. I can (and do) build dashboards to track SSIS performance (perhaps I should blog about them…) using this passively-collected performance metrics.
As with Lookup Transformation messages, OLE DB Destinations in Data Flow Tasks record specific messages using the same format every time. Because of this (hard-coded) consistency, you and I can passively collect the number of rows written while executing packages in the SSIS Catalog using the (default) Basic logging level. We can use the following Transact-SQL query to collect this execution metadata post-execution:
Click through for the script.
As Jen points out earlier in her Analytics Market Commoditization and Consolidation post (you should read it all – it’s awesome – like all of Jen’s posts!) many analytics solution providers share the “Same look, same marketing story, same saves time and allows users [to] avoid evil IT.”
I can hear some of you thinking, “Are you telling us analytics doesn’t work, Andy?” Goodness no. I’m telling you hype and sales strategy work in the analytics market as well as anywhere. When asked why a solution may not perform to expectations, the #1 response is “your data is not clean.”
Data engineering (think ETL specifically designed for analytics and “big data”) is the backbone behind data science. To Andy’s point, the data engineer’s job is to get clean, context-heavy data in front of a data scientist, the same way a “classical” Business Intelligence specialist works with analysts.
Customers coordinates: a flat file containing x,y coordinates for every customer.
Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source
The goal was to “look-up” the coordinates in the shape file in order to get the municipality code GKZ which in german stand for “Gemeindekennzahl”.
Check out the demo.
Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.
This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).
Read on for an explanation of how FlowFile streams can do this.