Naturally, we get an unbiased estimate of the overall prevalence of violation if we sample the videos uniformly from the population and have them reviewed by human raters to estimate the proportion of violating videos. We also get an unbiased estimate of the violation rate in each policy vertical. But given the low probability of violation and wanting to use our rater capacity wisely, this is not an adequate solution — we typically have too few positive labels in uniform samples to achieve an accurate estimate of the prevalence, especially for those sensitive policy verticals. To obtain a relative error of no more than 20%, we need roughly 100 positive labels, and more often than not, we have zero violation videos in the uniform samples for rarer policies.
This is similar in nature to testing for rare diseases, where a random sample of N people in the population is likely to turn up 0 cases of it.