Sandhya Ramu and Vasanth Rajamani have an after-action report:
For companies and organizations, failure tends to be far more illuminating than success and the lingering effects of a failure can be harmful if the team moves too quickly and does not resolve the issue in a thorough and transparent manner. We recently ran into a large incident that involved data loss in our big data ecosystem and by reflecting on our diagnosis and response, we hope that our learnings from an impactful incident in our big data ecosystem will be insightful.
Here’s what happened: roughly 2% of the machines across a handful of racks were inadvertently reimaged. This was caused by procedural gaps in our Hadoop infrastructure’s host life cycle management. Compounding our woes, the incident happened on our business-critical production cluster.
Read on to understand what happened and why. It’s a lesson in the importance of having a disaster recovery plan and testing it