Upload service degradation.

Incident Report for Uploadcare

Postmortem

On December 12, we had a degradation of our Upload API. Most users were unable to upload files for 3 hours 11 minutes between Dec 12, 22:39 GMT and Dec 13, 01:50 GMT.

What happened

Requests to Upload API were either handled extremely slowly or (most of them) rejected by our web workers. Scaling up our Upload API fleet didn't help.

What really happened

Further investigation revealed that:
— Slow requests were consuming all available web workers and without available workers requests were rejected by nginx.
— Handled requests were slow due to constant database locks on one database table.
— Database locks were caused by dramatical change of tracked usage statistics (change of project settings by one of our largest customers).

We've spent most of the time during incident on investigation and figuring our what is happening. Actual DB load was average, and DB was wrongly dismissed as source of issues at first. Once we've the root cause, the fix was trivial and took minutes to implement and deploy.

What we have done

We turned off usage tracking for particular customer.

What we will do

— Refactor statistic tracking, so it does not affect our core service.
— Add more specific monitors to our DB, so we could identify problems of similar nature much faster.

Posted Dec 17, 2018 - 16:48 UTC

Resolved

This incident has been resolved.

Posted Dec 13, 2018 - 09:03 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 13, 2018 - 01:54 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 13, 2018 - 01:27 UTC

Update

We are continuing to investigate this issue.

Posted Dec 12, 2018 - 23:44 UTC

Investigating

We're experiencing issues with out Upload API.

Posted Dec 12, 2018 - 23:34 UTC

This incident affected: Upload API.