Sampling and Inconsistent Result Counts

Kevin Wilkie does the math:

One of the things you may have noticed after reading our last post on Top (found here) is that sometimes SAMPLE doesn’t give the answer you want.

For example, we can run the same query to get 20% of the table. Remember that this table has 290 rows in total.

After seeing two runs return 69 and then 50 rows, respectively, Kevin digs in and finds out why. This got me thinking about whether a one-pass scan, assigning values based on a uniform distribution (which sounds like what is happening here) would be faster than random sampling without replacement over an array of 8-byte pointers, but then I realized that it’s way too early in the morning for me to be thinking architecture.