Change Detection With Hashes

Kevin Feasel

2017-06-14

ETL

Nigel Meakins shows how to use HashBytes to roll your own change detection:

So this all sounds very promising as a way of tracking changes to our Data Warehouse data, for purposes such as extracting deltas, inserts and updates to Type I and II dimensions and so forth. It doesn’t have any show-stopping overhead for the hashing operations for the sizes of data typically encountered and storage isn’t going to be an issue. It is native to T-SQL so we can rerun our hash value generation in the engine where our data resides rather than having to push through SSIS or some other tool to generate this for us. Algorithms are universal and as such will give us the same values wherever used for the same bytes of input. Let’s go back to the basic idea for a minute and consider how we implement this.

This is particularly useful in cases where you have metadata columns you don’t much care about (e.g., last modified time).  I do recommend using CONCAT or CONCAT_WS (if you’re on SQL Server 2017) to do string concatenation, though; it’d remove the need for util.CastAsNVarchar and possibly more.

Related Posts

An Apache Sqoop Tutorial

Subham Sinha has an introductory-level tutorial on Apache Sqoop: For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS. So, for this analysis the data residing in the relational database management systems […]

Read More

Handling Late Arrivals In SSIS

Peter Schott shows us a pattern for dealing with late-arriving dimension members in SQL Server Integration Services ETL packages: The general steps are Set up your source query. Pass the data through a Lookup for your Dimension with the missing results routed to a “No Match” output. Insert those “No Match” rows into your Dimension […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930