Diagnosing Duplicate Records

Kevin Feasel

2016-08-22

ETL

Jesse Seymour walks through his process of finding and fixing unexpected duplicate key violations:

In this case, the error message is quite clear.  There is more than one row in the source (staging) that matches a single row in the target (data warehouse).  When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse.  In most systems, you can use the source system’s primary key to accomplish this.  After all, most systems use a RDBMS of some sort to store data.  However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.

So why are we not using that?  There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint.  If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.

Figuring out the cause of the problem is a multi-step process, as Jesse shows.

Related Posts

When Data Factory Flows Don’t

Kevin Feasel

2017-12-11

Cloud, ETL

Emma Stewart points out an issue that might vex newcomers to Azure Data Factory: The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then […]

Read More

An Apache Sqoop Tutorial

Kevin Feasel

2017-11-22

ETL, Hadoop

Subham Sinha has an introductory-level tutorial on Apache Sqoop: For Hadoop developer, the actual game starts after the data is being loaded in HDFS. They play around this data in order to gain various insights hidden in the data stored in HDFS. So, for this analysis the data residing in the relational database management systems […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031