⚡ Incident Report👤 Neon-Osprey-33a Public company infrastructure1000+ engineers2025
01The Setup
We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 — Northern Virginia — is the backbone of an enormous number of internal AWS systems, not just customer workloads. EC2 instance lifecycle management, network configuration state, Lambda, ECS, EKS, Fargate — all of them quietly depend on DynamoDB for state coordination. We have a fully automated DNS management system to keep regional endpoints healthy. It consists of two decoupled components: a DNS Planner that monitors load balancer health and generates DNS update plans, and a DNS Enactor that applies those plans via Route 53. We run three parallel Enactors, one per availability zone, for redundancy. The design assumption was that eventual consistency across Enactors would handle any timing gaps. We had never seen this assumption fail — until 11:48 PM PDT on October 19, 2025.
02What Happened
At 11:48 PM PDT on October 19, our monitoring systems detected a spike in DynamoDB API error rates across us-east-1. The first thing on-call engineers saw was a flood of DNS resolution failures against `dynamodb.us-east-1.amazonaws.com`. The DNS record was not slow — it was empty. No IP addresses. The endpoint had simply vanished from Route 53.
Here is what had happened: DNS Enactor #1 was processing a routine DNS update plan but had fallen behind due to unusually high processing delays — the exact cause of that slowdown remains unclear from the postmortem. While Enactor #1 was working through its backlog, the DNS Planner, unaware of the delay, continued generating new plans at its normal pace. DNS Enactor #2, which was processing plans in parallel, moved through the queue quickly and then ran its stale-plan cleanup routine. It identified the plan still in progress by Enactor #1 as "old" and deleted it — along with all the IP addresses it was supposed to preserve. The cleanup routine's logic assumed any plan older than a threshold was safe to remove. It was not designed to check whether another Enactor was actively applying that plan.
The result: a clean, authoritative, empty DNS record for DynamoDB's entire us-east-1 regional endpoint was published to Route 53 and propagated globally. Every system that tried to resolve the DynamoDB endpoint — internal or external — got back nothing.
The cascade started immediately. The DropletWorkflow Manager (DWFM), which maintains state leases for the physical servers running EC2 instances, depends on DynamoDB to coordinate instance lifecycle transitions. With DNS gone, DWFM could not write or read lease state. Any EC2 operation requiring a state change — launch, stop, terminate, resize — began failing with "insufficient capacity errors." This was a lie; there was plenty of capacity. The instances just could not prove their state to DWFM.
As DynamoDB recovered at around 2:25 AM PDT, DWFM did what any sensible system would do when reconnecting after an outage: it tried to re-establish all of its leases across the entire EC2 fleet simultaneously. This created a thundering herd. DWFM entered what engineers described as "congestive collapse" — so many lease re-establishment requests were in flight that timeouts began cascading, making the congestion worse. The system that was supposed to heal itself was now actively preventing recovery.
The Network Manager, which handles routing configuration for new EC2 instances, had accumulated a large backlog of pending network state changes during the outage. As EC2 instances started coming back, Network Manager was so congested that new instances were failing health checks because their network configuration had not propagated yet. Network Load Balancers, seeing unhealthy instances, removed them from rotation — making the recovery look like a second wave of failures.
At peak, over 70 AWS services were impacted. Signal, Snapchat, Starbucks, Coinbase, Reddit, Ring, Amazon.com itself, Amazon Alexa, and Apple's streaming services all went down for tens of thousands of users.
03Timeline
T+0 (11:48 PM PDT, Oct 19) — DynamoDB API error rates spike in us-east-1. DNS record for regional endpoint is empty.
T+~30min — Engineers confirm Route 53 has published an empty DNS record. DynamoDB DNS automation is disabled globally.
T+~2h (2:25 AM PDT, Oct 20) — DynamoDB service recovers. DWFM begins thundering-herd lease re-establishment, triggering congestive collapse.
T+~5h — EC2 instance state changes partially recover after manual throttling of DWFM re-establishment requests.
T+~8h — Network Manager backlog begins to clear. New instances start passing health checks. NLBs begin restoring instance traffic.
T+~11h (10:48 AM PDT) — Majority of dependent services restored.
T+14h (1:50 PM PDT) — Full recovery confirmed across all 70+ affected services. Incident closed.
04The Resolution
Root cause: a race condition between two parallel DNS Enactors caused one to classify the other's in-progress plan as stale and delete it, publishing an empty DNS record for `dynamodb.us-east-1.amazonaws.com` to Route 53. The fix for DynamoDB itself was to manually restore the DNS records and disable the automated DNS management system globally while a safeguard preventing cleanup of actively-applied plans is implemented.
Recovery from the cascade took significantly longer than DynamoDB recovery itself — approximately 11 additional hours — due to DWFM's thundering-herd re-establishment behavior and Network Manager's backlog. Engineers had to manually throttle DWFM request rates to break the congestive collapse cycle and gradually drain the network configuration backlog. No data was lost. EC2 instances that were already running were unaffected; only state transitions failed. Customers experienced launch failures, invocation failures in Lambda/ECS/EKS, and widespread DNS resolution errors.
LessonsWhat We Learned
01Distributed cleanup routines that delete 'stale' state must verify no other worker is actively using that state before deleting it — eventually consistent systems are not safe for destructive operations without coordination.
02When a core dependency recovers from an outage, do not allow all dependents to reconnect simultaneously. Staggered reconnection with jitter and backoff prevents thundering-herd collapse from turning a 3-hour outage into a 14-hour one.
03An empty DNS record is not the same as a DNS failure — most monitoring systems alert on resolution errors, not on resolving-to-nothing. Assert that DNS records are non-empty, not just resolvable.
04Internal AWS services being customers of other AWS services means a single regional control plane failure can produce a blast radius that surprises even the teams who built the dependent systems. Map your blast radius explicitly; do not rely on intuition.
05Publishing a postmortem within days is valuable, but leaving the root cause of the initial slowdown (why did Enactor #1 fall behind?) unanswered in public means the contributing cause may recur. Surface all contributing causes, not just the triggering one.
06SLA percentages like 99.99% are calculated across a calendar year. A 14-hour outage in a single region consumes the entire annual error budget of a 99.9% SLA in one event. Design recovery procedures to match the blast radius of your worst-case failure mode, not your average one.
What I'd Do Differently
The DNS management system's cleanup logic was the proximate cause, but the deeper issue was an untested assumption: that two Enactors would never operate on the same plan simultaneously in a destructive way. I would have added an explicit distributed lock — or at minimum a compare-and-swap check against a shared lease record — before any Enactor performs a deletion. Cleanup routines that can empty production DNS records should require quorum confirmation from all active Enactors, not just a time-based staleness heuristic.
On recovery: I would have pre-written a runbook for 'DynamoDB us-east-1 is down for more than 30 minutes' that explicitly throttles DWFM reconnection attempts. The thundering herd on recovery was entirely predictable given the scale of the fleet, yet the system had no built-in protection against it. That is a runbook failure, not just a design failure.