An RDS Postmortem – Curated SQL

Andy Isaacson covers a performance issue Honeycomb experienced with RDS:

In retrospect, the failure chain had just 4 links:

The RDS MySQL database instance backing our production environment experienced a sudden and massive performance degradation; P95(query_time) went from 11 milliseconds to >1000 milliseconds, while write-operation throughput dropped from 780/second to 5/second, in just 20 seconds.
RDS did not identify a failure, so the Multi-AZ feature did not activate to fail over to the hot spare.
As a result of the delays due to increased query_time, the Go MySQL client’s connection pool filled up with connections waiting for slow query results to come back and opened more connections to compensate.
This exceeded the max_connections setting on the MySQL server, leading to cron jobs and newly-started daemons being unable to connect to the database and triggering many “Error 1040: Too many connections” log messages.

This was very interesting to read, and I applaud companies making public these kinds of post-mortems, especially because the idea of publicizing the reasons for failures is so scary.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31