Data Quality

Kevin Feasel

2016-09-02

R

Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.

Related Posts

Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased: In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data […]

Read More

R 3.6.1 Available

David Smith notes a new version of R is available: On July 5, the R Core Group released the source code for the latest update to R, R 3.6.1, and binaries are now available to download for Windows, Linux and Mac from your local CRAN mirror. R 3.6.1 is a minor update to R that fixes a few bugs. […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930