It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.
The multi-headed beast we know as Scale has reared its ugly visage.
As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.
Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.
In other words, sample.
Read on for a couple of methods. One thing I’ve taken a fancy to is collecting the first N of a particular type of message and keeping track of how often that message appears. If you get the same error for every row in a file, then you might really only need to see that one time and the number of times it happened. Or maybe you want to see a few of them to ensure that they’re really the same error and not two separate errors which are getting reported together due to insufficient error separation.