Data Quality

Kevin Feasel



Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.

Related Posts


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More

R 3.4.4 Now Available

David Smith notes that R 3.4.4 is now generally available: R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause […]

Read More


September 2016
« Aug Oct »