Analysis of the December 15, 2025, Upload API Disruption
1. Summary
On December 15, 2025, the Uploadcare platform experienced a significant service disruption affecting the Upload API and REST API subsystems. The incident, spanning approximately ten hours, resulted in elevated error rates for file uploads, increased latency across API endpoints, and interruptions of Dashboard operations.
2. Root cause
The root cause was a resource contention event on Amazon Elastic File System (EFS), aggravated by an observability side-effect. A monitoring agent, executing a metadata-heavy recursive file scan, monopolized the I/O capacity of the shared NFS mount points.
This I/O saturation created a cascading failure due to a legacy architectural pattern where storage operations occur within active database transactions. Background Processing Workers, taking long time to read or write to the stalled filesystem, held open database transactions for a long time. This saturated the global connection pool (PgBouncer), causing a denial of service for Upload API HTTP Workers and other services attempting to run database queries, despite the underlying PostgreSQL engine remaining healthy.
3. Architectural context and system design
The Uploadcare platform is designed for scale, but this incident exposed specific coupling risks between our storage, database, and monitoring layers.
3.1 Ingestion pipeline & worker roles
The core of the affected system is the Upload API. It utilizes two distinct worker types:
- Upload API HTTP workers. These synchronous workers handle incoming HTTP requests (Direct uploads, Multipart, and
from_url requests). They ingest data and schedule tasks.
- Background processing workers. These asynchronous workers pick up tasks from a queue to perform complex operations (virus scanning, image validation, property extraction, etc).
Both worker types utilize a shared temporary staging area backed by Amazon EFS to maintain a POSIX-compliant shared state across the distributed fleet.
3.2 Observability stack
To monitor the staging area's health, a Telegraf sidecar agent was configured to track file backlog depth. This was implemented via a standard Linux find command sequence.
On a network filesystem (NFS), a recursive find forces a traversal of the directory tree over the network. As file counts grow, the cost of this operation increases linearly, generating a storm of READDIRPLUS and GETATTR RPC calls.
4. Timeline
All times are listed in Coordinated Universal Time (UTC).
- 17:17: System begins receiving a massive surge of
from_url upload requests. This workload is write-heavy.
- 17:17 – 17:23: Upload API HTTP workers successfully ingest files and write them to EFS. However, Background processing workers (responsible for processing and deleting files) begin to slow down due to the increasing I/O load. The rate of file creation exceeds the rate of processing/deletion. The file count on the volume jumps from ~40 (healthy) to ~7,000.
- 17:25: The "Noisy Neighbor" effect begins. The Telegraf
find command, attempting to scan 7,000+ files, fails to complete within its 30-second timeout. We lose visibility of metrics related to the file staging area.
- 17:30: Multiple instances of the
find command stack up. The resulting metadata storm heavily affects the NFS client's ability to process requests. Background processing workers become very ineffective.
- 19:06: PostgreSQL begins terminating client connections due to
idle-in-transaction timeouts. This confirms workers are stuck holding connections.
- 20:28 – 20:51: Engineering attempts to horizontally scale the PgBouncers to absorb the held connections. This inadvertently pushes the upper layer PgBouncers toward their OS file descriptor limits.
- 21:07: Upper layer PgBouncer instances hit the Linux file descriptor limit (
ulimit). New connections to this layer begin failing.
- 21:28: File descriptor limits are increased. Background processing workers begin to recover slowly as the pool stabilizes.
- 23:30: Background processing workers successfully process the backlog. The background task queue becomes empty.
- 23:41: Despite the background queue being empty, Upload API HTTP workers continue to exhibit "overload" behavior.
- 01:35 (Dec 16): Engineering reconfigures Upload API HTTP workers to connect directly to the upper PgBouncer layer.
- 02:48: EFS metrics reappear.
- 02:58: Engineers manually removed approximately 4,000 "orphaned" files that remained on the disk (files where the background job failed during the crash and could not self-heal). File manipulations on EFS become normal.
- 02:59: Database connection pressure vanished. Latency returned to nominal levels. Incident resolved.
5. Response Evaluation
What went well
- Proactive alerting. Our monitoring systems successfully triggered alerts for elevated wait times before customer reports of degradation began to surface, giving the team an early head start on triage.
- Core database resilience. Despite the severe saturation of the connection poolers (PgBouncer), the underlying PostgreSQL database engine demonstrated remarkable stability. Resource utilization (CPU/Memory) on the database nodes remained nominal, preventing a total database meltdown.
- URL API isolation. The URL API (file processing and delivery) remained fully operational throughout the incident. Its architectural isolation from the ingestion pipeline ensured that end-user access to previously uploaded files was never affected.
What went wrong
- Observability became the bottleneck. The monitoring agent (
telegraf), designed to provide visibility, became the primary cause of the outage. The use of an active, heavy command like find on a network filesystem was a design flaw that turned a monitoring check into a Denial of Service attack on the NFS client.
- Misleading initial signals. The initial flood of
idle-in-transaction connections led the response team to focus on the database layer. It delayed the investigation into the storage layer.
5. Action Items
- Eliminate "Noisy Neighbor" observability. Replace active, resource-intensive monitoring checks (specifically the recursive
find command) with passive metrics. Transition to using AWS CloudWatch metrics or other side-channel indicators that do not consume client-side I/O credits or burden the EFS mount.
- Enforce connection pool isolation. Reconfigure PgBouncer to implement strict resource bulkheading. Dedicate separate connection pools for the Upload API HTTP workers, REST API HTTP workers, and background workers. This ensures that a resource stall in the background worker fleet cannot starve the customer-facing APIs of database connections.
- Decouple storage I/O from database state. Refactor the application logic to execute blocking file storage operations (reading/writing to EFS) outside of atomic database transactions. This ensures that storage latency results only in slower worker throughput, rather than holding database connections open (
idle-in-transaction) and exhausting the global pool.
- Upgrade incident response tooling. Equip production containers and nodes with essential diagnostic capabilities and establish secure access protocols. This ensures engineers can diagnose kernel-level and network-level bottlenecks during active incidents.
6. Conclusion
The December 15 disruption was a complex failure mode involving storage physics, monitoring side-effects, and database concurrency. By identifying the contention on the EFS mount (caused by the monitoring agent) as the root cause, rather than a hard AWS limit, we have a clear path to resolution. The removal of the invasive monitoring script and the decoupling of transactions will robustly insulate the Uploadcare platform from similar incidents in the future.
We apologize for the disruption and appreciate the patience of our customers as we hardened our systems.