REST and Upload API service degradation (incident #wvjpwt1qtpkn)
Date: 2024-07-08
Authors: Arsenii Baranov
Status: Complete
Summary: From 09:01:53 to 09:33:12 UTC we've experienced higher latencies and in the end a complete outage of Upload and REST API.
Root Causes: PostgreSQL performance degradation due to human-factor.
Trigger: Misconfiguration.
Resolution: Changed our DBA-related processes, fixed monitoring related issue.
Detection: Our Customer Success team detected the issue and escalated to the Engineering team.
Action Items:
Action Item |
Type |
Status |
Fix monitoring misconfiguration |
mitigate |
DONE |
Improve DBA maintenance approach |
prevent |
DONE |
Lessons Learned
What went well
- Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN.
- Our incident mitigation strategy was right and worked immediately.
What went wrong
- This incident was detected in non-automatic way due to alert misconfiguration.
- We failed to process API request during the incident.
Timeline
2024-07-08 (all times UTC)
- 08:00 Database maintenance started
- 09:01 SERVICE DEGRADATION BEGINS
- 09:23 Our customer success team escalates issue to Infrastructure team
- 09:29 Issue localised and fixed
- 09:33 SERVICE DEGRADATION ENDS