We've encountered increased error rates on our REST API endpoints.
This resulted in reduced reported uptime.
In fact, even though the uptime suffered it wasn't as bad as reported.
What happened:
- from February 25 22:00 UTC to February 26 05:50 UTC error rates on REST API endpoints were increased
Why that happened:
- one of the machines in REST API fleet ran out of memory
- due to OOM, the machine was unable to handle any incoming requests
- misconfigured health check prevented load balancer from getting rid of the failing machine
- part of all requests, including Pingdom (that reports our uptime) ones, was sent to that failing machine
What we've done:
- tracked down and terminated the failing machine
- fixed health check configuration