Service degradation

Incident Report for Uploadcare

Postmortem

Upload API and Video processing services degradation (incident #5r4zj8shr69c)

Date: 2023-10-02

Authors: Alyosha Gusev, Denis Bondar

Status: Complete

Summary: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services.

Root Causes: Cascading failure due to combination of exceptionally high amount of requests to Upload API.

Trigger: Latent bug triggered by sudden traffic spike.

Resolution: Changed our throttling politics, increased resources for processing.

Detection: Our Customer Success team detected the issue and escalated to the Engineering team.

Action Items:

Action Item	Type	Status
Test corresponding alerts for correctness	mitigate	DONE
Improve our upload processing system to remove bottleneck that we found	prevent	DONE
Fix service access issue for team members that form potential response teams	mitigate	DONE

‌

Lessons Learned

What went well

Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN.
Our incident mitigation strategy was right and worked immediately.

What went wrong

This incident was detected in non-automatic way due to alert misconfiguration.
Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner.

Timeline

2023-10-02 (all times UTC)

14:15 Our upload processing queue start filling
14:20 SERVICE DEGRADATION BEGINS
15:23 Our customer success team escalates issue to Infrastructure team
15:31 Issue localised
15:41 Incident response team is formed
15:51:13 Adjusted our throttling policies
15:51:38 Increased number of processing instances
16:40 SERVICE DEGRADATION ENDS Processing queues clear

Posted Oct 17, 2023 - 10:10 UTC

Resolved

From 14:15 to 16:45 UTC we’ve experienced higher latencies of from_url uploads and with video processing. We’ve identified the source of the problem, eliminated it and are monitoring the situation. These services are fully functional now.

Posted Oct 02, 2023 - 17:05 UTC

This incident affected: Upload API and Processing engines (Video processing).