The Magic Of Sampling

Nathan LeClaire reminds us of an important story that statisticians have been telling us for a couple centuries:

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

Read on for a couple of methods.  One thing I’ve taken a fancy to is collecting the first N of a particular type of message and keeping track of how often that message appears.  If you get the same error for every row in a file, then you might really only need to see that one time and the number of times it happened.  Or maybe you want to see a few of them to ensure that they’re really the same error and not two separate errors which are getting reported together due to insufficient error separation.

Related Posts

MAPE and Its Flaws

Jan Fischer takes us through Mean Absolute Percentage Error as a measure of forecast quality: Particular small actual values bias the MAPE.If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this […]

Read More

Calculating AUC in R

Andrew Treadway shows how you can calculate Area Under the Curve in R: AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For […]

Read More

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930