Upload API and Video processing services degradation (incident #5r4zj8shr69c)
Date: 2023-10-02
Authors: Alyosha Gusev, Denis Bondar
Status: Complete
Summary: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services.
Root Causes: Cascading failure due to combination of exceptionally high amount of requests to Upload API.
Trigger: Latent bug triggered by sudden traffic spike.
Resolution: Changed our throttling politics, increased resources for processing.
Detection: Our Customer Success team detected the issue and escalated to the Engineering team.
Action Items:
Action Item |
Type |
Status |
Test corresponding alerts for correctness |
mitigate |
DONE |
Improve our upload processing system to remove bottleneck that we found |
prevent |
DONE |
Fix service access issue for team members that form potential response teams |
mitigate |
DONE |
Lessons Learned
What went well
- Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN.
- Our incident mitigation strategy was right and worked immediately.
What went wrong
- This incident was detected in non-automatic way due to alert misconfiguration.
- Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner.
Timeline
2023-10-02 (all times UTC)
- 14:15 Our upload processing queue start filling
- 14:20 SERVICE DEGRADATION BEGINS
- 15:23 Our customer success team escalates issue to Infrastructure team
- 15:31 Issue localised
- 15:41 Incident response team is formed
- 15:51:13 Adjusted our throttling policies
- 15:51:38 Increased number of processing instances
- 16:40 SERVICE DEGRADATION ENDS Processing queues clear