Chad Callihan tries out a big improvement:
How do you go about finding the median percentile of a data set? What if you need the top x percentile? Both the APPROX_PERCENTILE_CONT and APPROX_PERCENTILE_DISC functions can be used to solve these questions.
Let’s look at how we can use each and what makes them unique.
The approximate percentiles are guaranteed to be accurate to within a certain percentage, something like 3-5%, if I remember correctly–it’s higher than HyperLogLog’s ~2.5% but not so large as to be of low value. If you’ve ever tried to calculate a median or other percentile like the 75th or 95th percentile, you might have used PERCENTILE_CONT()
in the past. At least until you get a few million rows in the table, at which point you stopped using it. My joke is, once you reach a certain table size, PERCENTILE_CONT()
becomes so slow that it’s faster to install and configure SQL Server ML Services, learn R or Python, and send in the data to calculate a percentile than to wait for PERCENTILE_CONT()
to complete.
The APPROX_PERCENTILE_*
series is way, way faster. On reasonable-sized test cases of a couple million rows or so, my recollection is two orders of magnitude better performance, so long as you can deal with being off by a few percentage points. One of the best scenarios for something like this is calculating 95th percentile response times. Does it really matter that the actual response time was 187.5ms and SQL Server said 192.6 or 181.4? Probably not—you get a good idea of the magnitude, and that’s the important part here.