Diagnosing Duplicate Records

Jesse Seymour walks through his process of finding and fixing unexpected duplicate key violations:

In this case, the error message is quite clear. There is more than one row in the source (staging) that matches a single row in the target (data warehouse). When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse. In most systems, you can use the source system’s primary key to accomplish this. After all, most systems use a RDBMS of some sort to store data. However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.

So why are we not using that? There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint. If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.

Figuring out the cause of the problem is a multi-step process, as Jesse shows.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31