Webhook Service Degradation
Incident Report for Uploadcare
Postmortem

Incident Summary

On 25 September 2024, an issue with webhook delivery was identified, affecting clients between 10:07 and 12:19 UTC. The delay impacted webhook notifications, with no data loss but a significant delay in processing and delivery.

Timeline

  • 10:07 UTC – A system configuration change was made, which inadvertently disrupted webhook processing.
  • 10:07 UTC – Webhook delivery issues began.
  • 12:05 UTC – The problem was identified and resolved, with backlogged webhooks being processed.
  • 12:12 UTC – The first webhook was successfully delivered after the fix.
  • 12:19 UTC – All queued events were processed, with delivery confirmed for all affected users.

Root Cause

The issue was caused by a configuration change that resulted in the webhook delivery system not processing events correctly. Despite initial signs of system health, the disruption went undetected due to gaps in the system’s monitoring tools.

Impact

  • Webhook delivery was delayed for approximately 2 hours.
  • Customers experienced delays in receiving event notifications.
  • No data was lost, but delivery delays were significant due to a backlog in event processing.

Challenges During Resolution

  • Monitoring systems indicated that components of the webhook system were healthy, which delayed identification of the underlying problem.

Resolution

  • Webhook processing was restarted, and we verified that all queued events were delivered without any data loss.
  • The incident was fully resolved by 12:19 UTC, with all webhooks processed and delivered.

Action Items

Short-term

  • Improve the system’s monitoring and alerting to better detect issues with webhook processing.

Long-term

  • Explore options to improve the resilience of our webhook delivery system, including scaling the infrastructure to better handle failures.
Posted Sep 26, 2024 - 09:00 UTC

Resolved
This incident has been resolved. We apologize for any inconvenience this may have caused.
Posted Sep 25, 2024 - 12:19 UTC
Investigating
We're experiencing a slowdown in our Webhooks service.
Posted Sep 25, 2024 - 10:07 UTC
This incident affected: Webhooks.