Sampling and Estimating Rare Events

Yi Liu takes us through a process to estimate rare events:

Naturally, we get an unbiased estimate of the overall prevalence of violation if we sample the videos uniformly from the population and have them reviewed by human raters to estimate the proportion of violating videos. We also get an unbiased estimate of the violation rate in each policy vertical. But given the low probability of violation and wanting to use our rater capacity wisely, this is not an adequate solution — we typically have too few positive labels in uniform samples to achieve an accurate estimate of the prevalence, especially for those sensitive policy verticals. To obtain a relative error of no more than 20%, we need roughly 100 positive labels, and more often than not, we have zero violation videos in the uniform samples for rarer policies.

This is similar in nature to testing for rare diseases, where a random sample of N people in the population is likely to turn up 0 cases of it.

Related Posts

Python and R Data Reshaping

John Mount takes us through a couple of data shaping packages: The advantages of data_algebra and cdata are: – The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).– The transform systems can print what a transform is going to […]

Read More

When to Use Different ML Algorithms

Stefan Franczuk explains the different categories of machine learning algorithms available in Talend: Clustering is the task of grouping together a set of objects in such a way, that objects in the same group are more similar to each other than to those in other groups. Clustering is really useful for identify separate groups and […]

Read More


August 2019
« Jul Sep »