DynamoDB DNS automation emptied a regional endpoint
A latent race in DynamoDB's DNS management system removed all IP addresses from the US-EAST-1 regional endpoint. The first outage lasted under three hours, but dependent AWS services kept failing as recovery backlogs formed.
DynamoDB's regional endpoint was a small string with a large job: dynamodb.us-east-1.amazonaws.com had to resolve to healthy load balancers before customers or AWS services could create new DynamoDB connections in US-EAST-1. The AWS outage did not start with a database engine failure. It started with DNS automation that managed the endpoint records for a huge fleet.
That automation split planning from execution. A planner produced DNS plans; enactors in multiple Availability Zones applied those plans to Route 53. The design aimed for availability, but it depended on one condition: when an enactor reached an endpoint, the plan it carried had to still be the freshest safe one — not a version already superseded by another enactor mid-flight. During this incident, one enactor was delayed while another applied a newer plan and cleaned up older plans.
The delayed enactor then wrote an older plan to the DynamoDB regional endpoint. The cleanup process deleted that older plan. The endpoint had no IP addresses left, and the automation could not repair the inconsistent state without manual intervention. Any system that needed a new DynamoDB connection through the public regional endpoint began failing.
Restoring DynamoDB DNS did not end the AWS US-EAST-1 incident. EC2's launch workflow had been losing host leases while DynamoDB was unreachable, then entered congestive collapse while trying to reestablish them. Network Load Balancer health checks reacted to instances whose network state had not finished propagating. Lambda, ECS, EKS, Fargate, STS, Redshift, Connect, and console paths saw their own versions of the same pattern: the first dependency came back, but a recovery backlog had become a second incident.
All IP addresses for the regional endpoint were immediately removed.// AWS post-event summary, October 2025
From the first signal to all-clear in 14h 32m.
A stale DNS enactor deleted the plan it had just applied.
The immediate cause was a race between DynamoDB DNS Enactor instances. One delayed enactor applied an older DNS plan to the regional endpoint after a newer enactor had already completed a later plan. The newer enactor's cleanup then deleted the old plan, which removed all IP addresses from the active DynamoDB regional endpoint.
The deeper failure was that the DNS automation could create an inconsistent state it could not repair. A freshness check happened only when the delayed enactor started, not when it finally applied each endpoint update, and the active plan deletion blocked subsequent enactors from applying replacements without manual intervention.