REST API and Upload API outage

Incident Report for Uploadcare

Postmortem

REST and Upload API service degradation (incident #wvjpwt1qtpkn)

Date: 2024-07-08

Authors: Arsenii Baranov

Status: Complete

Summary: From 09:01:53 to 09:33:12 UTC we've experienced higher latencies and in the end a complete outage of Upload and REST API.

Root Causes: PostgreSQL performance degradation due to human-factor.

Trigger: Misconfiguration.

Resolution: Changed our DBA-related processes, fixed monitoring related issue.

Detection: Our Customer Success team detected the issue and escalated to the Engineering team.

Action Items:

Action Item Type Status
Fix monitoring misconfiguration mitigate DONE
Improve DBA maintenance approach prevent DONE

Lessons Learned

What went well

  • Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN.
  • Our incident mitigation strategy was right and worked immediately.

What went wrong

  • This incident was detected in non-automatic way due to alert misconfiguration.
  • We failed to process API request during the incident.

Timeline

2024-07-08 (all times UTC)

  • 08:00 Database maintenance started
  • 09:01 SERVICE DEGRADATION BEGINS
  • 09:23 Our customer success team escalates issue to Infrastructure team
  • 09:29 Issue localised and fixed
  • 09:33 SERVICE DEGRADATION ENDS
Posted Jul 16, 2024 - 14:51 UTC

Resolved

2024-07-08 09:01:53 UTC
REST API and Upload API became unavailable due to system misconfiguration.

2024-07-08 09:33:12 UTC
We've identified the source of problems, deployed fixes. Service recovered. We are monitoring the situation.
Posted Jul 08, 2024 - 09:31 UTC