Diagnosing Duplicate Records

Kevin Feasel

2016-08-22

ETL

Jesse Seymour walks through his process of finding and fixing unexpected duplicate key violations:

In this case, the error message is quite clear.  There is more than one row in the source (staging) that matches a single row in the target (data warehouse).  When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse.  In most systems, you can use the source system’s primary key to accomplish this.  After all, most systems use a RDBMS of some sort to store data.  However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.

So why are we not using that?  There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint.  If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.

Figuring out the cause of the problem is a multi-step process, as Jesse shows.

Related Posts

Tips For Managing Business Logic

Tim Mitchell has a few tips for managing business logic, particularly when building ETL processes: First things first: let’s get on the same page with what is meant by business logic. When I refer to business logic (also commonly referred to as business rules), I’m talking about the processing rules that are used to transform an […]

Read More

Azure Data Factory v2 And Decompression

Kevin Feasel

2018-04-23

Bugs, Cloud, ETL

Ben Jarvis notes a file naming bug with Azure Data Factory v2 when decompressing files: ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases. […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031