Service degradation
Incident Report for Uploadcare
Postmortem

Upload API and Video processing services degradation (incident #5r4zj8shr69c)

Date: 2023-10-02

Authors: Alyosha Gusev, Denis Bondar

Status: Complete, action items in progress

Summary: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services.

Root Causes: Cascading failure due to combination of exceptionally high amount of requests to Upload API.

Trigger: Latent bug triggered by sudden traffic spike.

Resolution: Changed our throttling politics, increased resources for processing.

Detection: Our Customer Success team detected the issue and escalated to the Engineering team.

Action Items:

Action Item Type Status
Test corresponding alerts for correctness mitigate DONE
Improve our upload processing system to remove bottleneck that we found prevent DONE
Fix service access issue for team members that form potential response teams mitigate DONE

Lessons Learned

What went well

  • Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN.
  • Our incident mitigation strategy was right and worked immediately.

What went wrong

  • This incident was detected in non-automatic way due to alert misconfiguration.
  • Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner.

Timeline

2023-10-02 (all times UTC)

  • 14:15 Our upload processing queue start filling
  • 14:20 SERVICE DEGRADATION BEGINS
  • 15:23 Our customer success team escalates issue to Infrastructure team
  • 15:31 Issue localised
  • 15:41 Incident response team is formed
  • 15:51:13 Adjusted our throttling policies
  • 15:51:38 Increased number of processing instances
  • 16:40 SERVICE DEGRADATION ENDS Processing queues clear
Posted Oct 17, 2023 - 10:10 UTC

Resolved
From 14:15 to 16:45 UTC we’ve experienced higher latencies of from_url uploads and with video processing. We’ve identified the source of the problem, eliminated it and are monitoring the situation. These services are fully functional now.
Posted Oct 02, 2023 - 17:05 UTC
This incident affected: Upload API and Processing engines (Video processing).