Increased REST API error rates

Incident Report for Uploadcare

Resolved

We've encountered increased error rates on our REST API endpoints.
This resulted in reduced reported uptime.
In fact, even though the uptime suffered it wasn't as bad as reported.

What happened:

- from February 25 22:00 UTC to February 26 05:50 UTC error rates on REST API endpoints were increased

Why that happened:

- one of the machines in REST API fleet ran out of memory
- due to OOM, the machine was unable to handle any incoming requests
- misconfigured health check prevented load balancer from getting rid of the failing machine
- part of all requests, including Pingdom (that reports our uptime) ones, was sent to that failing machine

What we've done:

- tracked down and terminated the failing machine
- fixed health check configuration

Posted Feb 25, 2018 - 12:00 UTC