FM-020AWS2025-10-19impact 14h 32mSEV-1

DynamoDB DNS automation emptied a regional endpoint

A latent race in DynamoDB's DNS management system removed all IP addresses from the US-EAST-1 regional endpoint. The first outage lasted under three hours, but dependent AWS services kept failing as recovery backlogs formed.

dns automation dynamodb distributed-systems regional-outage lease-management

summary

DynamoDB's regional endpoint was a small string with a large job: dynamodb.us-east-1.amazonaws.com had to resolve to healthy load balancers before customers or AWS services could create new DynamoDB connections in US-EAST-1. The AWS outage did not start with a database engine failure. It started with DNS automation that managed the endpoint records for a huge fleet.

That automation split planning from execution. A planner produced DNS plans; enactors in multiple Availability Zones applied those plans to Route 53. The design aimed for availability, but it depended on one condition: when an enactor reached an endpoint, the plan it carried had to still be the freshest safe one — not a version already superseded by another enactor mid-flight. During this incident, one enactor was delayed while another applied a newer plan and cleaned up older plans.

The delayed enactor then wrote an older plan to the DynamoDB regional endpoint. The cleanup process deleted that older plan. The endpoint had no IP addresses left, and the automation could not repair the inconsistent state without manual intervention. Any system that needed a new DynamoDB connection through the public regional endpoint began failing.

Restoring DynamoDB DNS did not end the AWS US-EAST-1 incident. EC2's launch workflow had been losing host leases while DynamoDB was unreachable, then entered congestive collapse while trying to reestablish them. Network Load Balancer health checks reacted to instances whose network state had not finished propagating. Lambda, ECS, EKS, Fargate, STS, Redshift, Connect, and console paths saw their own versions of the same pattern: the first dependency came back, but a recovery backlog had become a second incident.

All IP addresses for the regional endpoint were immediately removed.// AWS post-event summary, October 2025

timeline · UTC

From the first signal to all-clear in 14h 32m.

23:48 PDT Oct 19

DynamoDB endpoint resolution fails

The public DynamoDB regional endpoint in US-EAST-1 began returning DNS failures. Customer traffic and AWS internal services that depended on the endpoint could not establish new connections.

00:38 PDT Oct 20

DNS state identified as source

AWS engineers identified DynamoDB DNS state as the source of the outage. Temporary mitigations then focused on restoring internal access so recovery tooling could work.

02:25 PDT Oct 20

DynamoDB DNS restored

All DNS information for DynamoDB was restored. Customers recovered as cached DNS records expired, and global table replicas caught up shortly afterward.

04:14 PDT Oct 20

EC2 recovery enters congestive collapse

EC2's droplet lease manager could not reestablish leases fast enough before they timed out again. AWS throttled incoming work and restarted hosts to clear queues and let lease recovery progress.

09:36 PDT Oct 20

NLB health failover disabled

Network Load Balancer health checks were flapping because new EC2 instances lacked fully propagated network state. AWS disabled automatic health-check failovers to restore available capacity.

14:20 PDT Oct 20

Primary AWS services recover

Container services recovered by 14:20 PDT. Several related services had already recovered, while some Redshift cluster workflows continued into the next day.

root cause

A stale DNS enactor deleted the plan it had just applied.

The immediate cause was a race between DynamoDB DNS Enactor instances. One delayed enactor applied an older DNS plan to the regional endpoint after a newer enactor had already completed a later plan. The newer enactor's cleanup then deleted the old plan, which removed all IP addresses from the active DynamoDB regional endpoint.

The deeper failure was that the DNS automation could create an inconsistent state it could not repair. A freshness check happened only when the delayed enactor started, not when it finally applied each endpoint update, and the active plan deletion blocked subsequent enactors from applying replacements without manual intervention.

contributing factors

What turned an endpoint bug into a regional cascade.

Many AWS services depended on DynamoDB during recovery.

The first failure was DynamoDB DNS, but EC2, Lambda, Redshift, STS, Support Center, and console paths all had dependencies that needed DynamoDB or systems backed by it. Once the endpoint failed, recovery work for other services started from a degraded control-plane state.

EC2 leases aged out while DynamoDB was unreachable.

Existing EC2 instances stayed healthy, but the droplet workflow manager could not refresh leases during the DynamoDB outage. When DynamoDB returned, the fleet had a large lease-recovery backlog that overwhelmed the manager and blocked new launches.

Health checks reacted to incomplete network state.

NLB brought newly launched EC2 instances into service before their network configuration had fully propagated. Health checks alternated between success and failure, which removed otherwise healthy capacity and increased connection errors.

Backlogs became recovery work in their own right.

The incident did not end when DNS records returned. EC2 leases, network propagation, Lambda event sources, container launches, and Redshift replacement workflows all had queues or throttles that had to drain without overloading the recovering systems.

lessons

What to take from this incident.

Validate freshness at the point of mutation.A plan can become stale after work begins. Distributed automation that applies multi-step state should recheck generation, ownership, and deletion safety immediately before each externally visible mutation.

Design recovery loops for fleet-scale backlog.A control plane that recovers by touching every host or lease must be tested with the backlog it will accumulate during an outage. Rate limits should respond to queue depth and retry cost, not only incoming request volume.

Treat health checks as load-bearing control logic.A health check that removes capacity can amplify partial recovery. Include propagation lag, flapping state, and dependent-system backlogs in load balancer failover tests.

sources

Read the sources.

Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region

aws.amazon.com ↗

← previous

FM-019 · The Encryption Path Under Slack Messages

FM-021 · Channel File 291 crashes Windows sensors