On December 12, we had a degradation of our Upload API. Most users were unable to upload files for 3 hours 11 minutes between Dec 12, 22:39 GMT and Dec 13, 01:50 GMT.
Requests to Upload API were either handled extremely slowly or (most of them) rejected by our web workers. Scaling up our Upload API fleet didn't help.
Further investigation revealed that:
— Slow requests were consuming all available web workers and without available workers requests were rejected by nginx.
— Handled requests were slow due to constant database locks on one database table.
— Database locks were caused by dramatical change of tracked usage statistics (change of project settings by one of our largest customers).
We've spent most of the time during incident on investigation and figuring our what is happening. Actual DB load was average, and DB was wrongly dismissed as source of issues at first. Once we've the root cause, the fix was trivial and took minutes to implement and deploy.
We turned off usage tracking for particular customer.
— Refactor statistic tracking, so it does not affect our core service.
— Add more specific monitors to our DB, so we could identify problems of similar nature much faster.