Press "Enter" to skip to content

Category: ETL / ELT

Change Detection With Hashes

Nigel Meakins shows how to use HashBytes to roll your own change detection:

So this all sounds very promising as a way of tracking changes to our Data Warehouse data, for purposes such as extracting deltas, inserts and updates to Type I and II dimensions and so forth. It doesn’t have any show-stopping overhead for the hashing operations for the sizes of data typically encountered and storage isn’t going to be an issue. It is native to T-SQL so we can rerun our hash value generation in the engine where our data resides rather than having to push through SSIS or some other tool to generate this for us. Algorithms are universal and as such will give us the same values wherever used for the same bytes of input. Let’s go back to the basic idea for a minute and consider how we implement this.

This is particularly useful in cases where you have metadata columns you don’t much care about (e.g., last modified time).  I do recommend using CONCAT or CONCAT_WS (if you’re on SQL Server 2017) to do string concatenation, though; it’d remove the need for util.CastAsNVarchar and possibly more.

Comments closed

Mapping File Shenanigans With The Import/Export Wizard

Angela Henry ran into problems copying a boatload of data from a mainframe-hosted DB2 server and has lived to tell the tale:

This post talks about the issue I ran into with SSIS Mapping Files.

We currently run DB2 on an IBM iSeries AS400 for our ERP system.  I was tasked with copying data from the AS400 to a SQL Server database for some consultants to use.  The C-Suite didn’t want to give the consultants access to our AS400, so this was the work around that was put forth and accepted (and no, no one asked me before I was “voluntold” for the task).  Since this would essentially be a “one-time” thing, I chose to use the Import Export Wizard, but I would save the package just in case they wanted this process repeated.

Sounds like it was a painful experience, but it does have a happy ending.

Comments closed

JSON Data Sources In SSIS

Chris Koester shows how to read JSON data sources in SQL Server Integration Services:

Once the Script Component has been defined as a source, the output columns can be defined. For this post, the same USGS Earthquake data that was used in the “Download JSON data with PowerShell” post will serve as an example. Be careful to choose the correct data types here. This can be tedious because you have to choose the correct data types in the C# code as well, and ensure that they correspond with the SSIS types. It’s helpful to bookmark a SSIS data type translation table for reference.

It does involve creating a script component, but aside from the tedium that Chris mentions, it’s not too bad.

1 Comment

Collecting Rows Written From SSISDB

Andy Leonard has a script showing how to get the number of records written to an OLEDB destination, as reflected in the SSIS Catalog:

I prefer to passively gather performance metrics – even if it’s a little convoluted and more work (for me). I don’t often need these metrics immediately available, so I execute queries similar to the query below and store the performance metrics in a table. I can (and do) build dashboards to track SSIS performance (perhaps I should blog about them…) using this passively-collected performance metrics.

As with Lookup Transformation messages, OLE DB Destinations in Data Flow Tasks record specific messages using the same format every time. Because of this (hard-coded) consistency, you and I can passively collect the number of rows written while executing packages in the SSIS Catalog using the (default) Basic logging level. We can use the following Transact-SQL query to collect this execution metadata post-execution:

Click through for the script.

Comments closed

The Continued Importance Of ETL

Andy Leonard explains that good old ETL remains vital to an organization:

A Problem

As Jen points out earlier in her Analytics Market Commoditization and Consolidation post (you should read it all – it’s awesome – like all of Jen’s posts!) many analytics solution providers share the “Same look, same marketing story, same saves time and allows users [to] avoid evil IT.”

I can hear some of you thinking, “Are you telling us analytics doesn’t work, Andy?” Goodness no. I’m telling you hype and sales strategy work in the analytics market as well as anywhere. When asked why a solution may not perform to expectations, the #1 response is “your data is not clean.”

Data engineering (think ETL specifically designed for analytics and “big data”) is the backbone behind data science.  To Andy’s point, the data engineer’s job is to get clean, context-heavy data in front of a data scientist, the same way a “classical” Business Intelligence specialist works with analysts.

Comments closed

Using Talend To Build Shape Files

Paul Hernandez has a demo where he uses Talend’s product to convert latitude and longitude pairs to a shape file:

Input data

Customers coordinates: a flat file containing x,y coordinates for every customer.

Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source

Goal

The goal was to “look-up” the coordinates in the shape file in order to get the municipality code GKZ which in german stand for “Gemeindekennzahl”.

Check out the demo.

Comments closed

FlowFile Continuation In NiFi

Tim Spann describes one of the more powerful features of Apache NiFi:

Sometimes, you need to backup your current running flow, let that flow run at a later date, or make a backup of what is in-process now. You want this in a permanent storage and want to reconstitute it later like orange juice and add it back into the flow or restart it.

This could be due to failures, for integration testing, for testing new versions of components, as a checkpoint, or for many other purposes. You don’t always want to reprocess the original source or files (they may be gone).

Read on for an explanation of how FlowFile streams can do this.

Comments closed

Moving Files In Azure Data Factory

Meagan Longoria has a workaround for how you cannot move a file using Azure Data Factory:

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

Read on for the full details on this workaround.  Also, vote on this feedback item if you want the ability to move files instead of just copying them.

Comments closed

Shredding Excel With R

John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Comments closed