Milind Paradkar discusses clean data:
We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.
NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.
I like this post because it exposes a data quality issue people don’t tend to think about very often: when all of the data is legitimate and correctly-structured, but there are gaps in the available data set. This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.