Stacia Varga has a post covering some of the yeoman’s work of data cleansing:
For now, Power BI continues to my tool of choice for my project. My goals for today’s post are two-fold: 1) finish my work to address missing venues in the games table and 2) to investigate the remaining anomalies in the games and scores tables as I noted in my last post.
To recap, I noted the following data values that warranted further investigation :
-
Total Goals minimum of 0 seems odd – because hockey games do not end in ties. I would expect a minimum of 0 so I need to determine why this number is appearing.
-
Total Goals maximum of 29 seems high – it implies that either one team really smoked the opposing team or that both teams scored highly. I’d like to see what those games look like and validate the accuracy.
-
Record Losses minimum of 0 seems odd also – that means at least one team has never had a losing season?
-
Similarly, Record Wins minimum of 0 means one team has never won?
-
Record OT minimum of 0 – I’m not sure how to interpret. I need to look.
-
Score minimum of 0 seems to imply the same thing as Total Goals minimum of 0, which I have already noted seems odd.
This is the kind of stuff that we talk about as taking 80-95% of a data science team’s time. It’s all about finding “weird” looking values, investigating those values, and determining whether the input data really was correct or if there was an issue.