Upload API Degradation.

Incident Report for Uploadcare

Postmortem

Analysis of the December 15, 2025, Upload API Disruption

1. Summary

On December 15, 2025, the Uploadcare platform experienced a significant service disruption affecting the Upload API and REST API subsystems. The incident, spanning approximately ten hours, resulted in elevated error rates for file uploads, increased latency across API endpoints, and interruptions of Dashboard operations.

2. Root cause

The root cause was a resource contention event on Amazon Elastic File System (EFS), aggravated by an observability side-effect. A monitoring agent, executing a metadata-heavy recursive file scan, monopolized the I/O capacity of the shared NFS mount points.

This I/O saturation created a cascading failure due to a legacy architectural pattern where storage operations occur within active database transactions. Background Processing Workers, taking long time to read or write to the stalled filesystem, held open database transactions for a long time. This saturated the global connection pool (PgBouncer), causing a denial of service for Upload API HTTP Workers and other services attempting to run database queries, despite the underlying PostgreSQL engine remaining healthy.

3. Architectural context and system design

The Uploadcare platform is designed for scale, but this incident exposed specific coupling risks between our storage, database, and monitoring layers.

3.1 Ingestion pipeline & worker roles

The core of the affected system is the Upload API. It utilizes two distinct worker types:

Upload API HTTP workers. These synchronous workers handle incoming HTTP requests (Direct uploads, Multipart, and from_url requests). They ingest data and schedule tasks.
Background processing workers. These asynchronous workers pick up tasks from a queue to perform complex operations (virus scanning, image validation, property extraction, etc).

Both worker types utilize a shared temporary staging area backed by Amazon EFS to maintain a POSIX-compliant shared state across the distributed fleet.

3.2 Observability stack

To monitor the staging area's health, a Telegraf sidecar agent was configured to track file backlog depth. This was implemented via a standard Linux find command sequence.

On a network filesystem (NFS), a recursive find forces a traversal of the directory tree over the network. As file counts grow, the cost of this operation increases linearly, generating a storm of READDIRPLUS and GETATTR RPC calls.

4. Timeline

All times are listed in Coordinated Universal Time (UTC).

17:17: System begins receiving a massive surge of from_url upload requests. This workload is write-heavy.
17:17 – 17:23: Upload API HTTP workers successfully ingest files and write them to EFS. However, Background processing workers (responsible for processing and deleting files) begin to slow down due to the increasing I/O load. The rate of file creation exceeds the rate of processing/deletion. The file count on the volume jumps from ~40 (healthy) to ~7,000.
17:25: The "Noisy Neighbor" effect begins. The Telegraf find command, attempting to scan 7,000+ files, fails to complete within its 30-second timeout. We lose visibility of metrics related to the file staging area.
17:30: Multiple instances of the find command stack up. The resulting metadata storm heavily affects the NFS client's ability to process requests. Background processing workers become very ineffective.
19:06: PostgreSQL begins terminating client connections due to idle-in-transaction timeouts. This confirms workers are stuck holding connections.
20:28 – 20:51: Engineering attempts to horizontally scale the PgBouncers to absorb the held connections. This inadvertently pushes the upper layer PgBouncers toward their OS file descriptor limits.
21:07: Upper layer PgBouncer instances hit the Linux file descriptor limit (ulimit). New connections to this layer begin failing.
21:28: File descriptor limits are increased. Background processing workers begin to recover slowly as the pool stabilizes.
23:30: Background processing workers successfully process the backlog. The background task queue becomes empty.
23:41: Despite the background queue being empty, Upload API HTTP workers continue to exhibit "overload" behavior.
01:35 (Dec 16): Engineering reconfigures Upload API HTTP workers to connect directly to the upper PgBouncer layer.
02:48: EFS metrics reappear.
02:58: Engineers manually removed approximately 4,000 "orphaned" files that remained on the disk (files where the background job failed during the crash and could not self-heal). File manipulations on EFS become normal.
02:59: Database connection pressure vanished. Latency returned to nominal levels. Incident resolved.

5. Response Evaluation

What went well

Proactive alerting. Our monitoring systems successfully triggered alerts for elevated wait times before customer reports of degradation began to surface, giving the team an early head start on triage.
Core database resilience. Despite the severe saturation of the connection poolers (PgBouncer), the underlying PostgreSQL database engine demonstrated remarkable stability. Resource utilization (CPU/Memory) on the database nodes remained nominal, preventing a total database meltdown.
URL API isolation. The URL API (file processing and delivery) remained fully operational throughout the incident. Its architectural isolation from the ingestion pipeline ensured that end-user access to previously uploaded files was never affected.

What went wrong

Observability became the bottleneck. The monitoring agent (telegraf), designed to provide visibility, became the primary cause of the outage. The use of an active, heavy command like find on a network filesystem was a design flaw that turned a monitoring check into a Denial of Service attack on the NFS client.
Misleading initial signals. The initial flood of idle-in-transaction connections led the response team to focus on the database layer. It delayed the investigation into the storage layer.

5. Action Items

Eliminate "Noisy Neighbor" observability. Replace active, resource-intensive monitoring checks (specifically the recursive find command) with passive metrics. Transition to using AWS CloudWatch metrics or other side-channel indicators that do not consume client-side I/O credits or burden the EFS mount.
Enforce connection pool isolation. Reconfigure PgBouncer to implement strict resource bulkheading. Dedicate separate connection pools for the Upload API HTTP workers, REST API HTTP workers, and background workers. This ensures that a resource stall in the background worker fleet cannot starve the customer-facing APIs of database connections.
Decouple storage I/O from database state. Refactor the application logic to execute blocking file storage operations (reading/writing to EFS) outside of atomic database transactions. This ensures that storage latency results only in slower worker throughput, rather than holding database connections open (idle-in-transaction) and exhausting the global pool.
Upgrade incident response tooling. Equip production containers and nodes with essential diagnostic capabilities and establish secure access protocols. This ensures engineers can diagnose kernel-level and network-level bottlenecks during active incidents.

6. Conclusion

The December 15 disruption was a complex failure mode involving storage physics, monitoring side-effects, and database concurrency. By identifying the contention on the EFS mount (caused by the monitoring agent) as the root cause, rather than a hard AWS limit, we have a clear path to resolution. The removal of the invasive monitoring script and the decoupling of transactions will robustly insulate the Uploadcare platform from similar incidents in the future.

We apologize for the disruption and appreciate the patience of our customers as we hardened our systems.

Posted Dec 17, 2025 - 17:55 UTC

Resolved

The incident is resolved.

Posted Dec 16, 2025 - 03:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 16, 2025 - 03:06 UTC

Investigating

We experience more DB lateny issues.
Investigating, looking for the source of the latency.

Posted Dec 16, 2025 - 01:15 UTC

Update

We experience more DB lateny issues.

Posted Dec 16, 2025 - 01:11 UTC

Update

The Upload API is still experiencing degradation.
You may see longer-than-usual upload processing times, as well as intermittent 5XX responses.
We’re actively working to resolve the underlying issue and will continue to share updates.

Posted Dec 16, 2025 - 00:30 UTC

Update

We are continuing to monitor for any further issues.

Posted Dec 16, 2025 - 00:10 UTC

Update

DB performance is recovered along with REST API and Dashboard.

Posted Dec 15, 2025 - 23:15 UTC

Update

Extensive processing of Upload API backlog affects DB and therefore latency of REST API and Dashboard.

Posted Dec 15, 2025 - 22:45 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
The upload "finish" wait times are still outside of norm.

Posted Dec 15, 2025 - 21:59 UTC

Update

Currently there is a very big task queue accrued during initial production issue phase.
We've fixed original issue and are working through this queue.

All uploaded files should be accounted for eventually.
For end users it still looks like uploads are not working.

Affected endpoints:
- `/from_url/`
- `/base/`

Posted Dec 15, 2025 - 20:55 UTC

Update

We are continuing to work on a fix for this issue.

Posted Dec 15, 2025 - 20:29 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 15, 2025 - 19:24 UTC

Update

We are continuing to investigate this issue.

Posted Dec 15, 2025 - 18:50 UTC

Update

We're experiencing performance degradation for Upload API.
Uploads are finished with significantly higher latency.

Posted Dec 15, 2025 - 18:07 UTC

Investigating

We are currently investigating this issue.

Posted Dec 15, 2025 - 18:05 UTC

This incident affected: Upload API, REST API, and Dashboard.