A misrouted upgrade triggers an EBS re-mirroring storm.
How a routine upgrade traffic-shift sent EBS replication traffic onto the lower-capacity secondary network instead of a peer router on the primary, why thousands of EBS nodes then searched simultaneously for new mirrors and saturated the cluster, and why an RDS Multi-AZ failover bug then prevented automatic recovery for 2.5% of Multi-AZ instances in the region.
EBS durability depended on each storage node keeping a replica it could reach. A primary network carried replication traffic, and a secondary network existed for redundancy. When a node lost its replica because the peer failed or the network path went away, it searched the cluster for free capacity and re-mirrored. That behavior was locally correct and had carried EBS through normal failures for years. Its quiet assumption was that too many nodes would not start searching at once.
On the morning of April 21, 2011, AWS was performing a routine upgrade on the primary EBS network in one US-East Availability Zone. The standard step was to shift traffic off one of the redundant routers on the primary network so it could be upgraded. The shift was executed incorrectly and the traffic landed on the lower-capacity secondary EBS network instead of the other primary router. For a large set of nodes, both intended primary paths were now gone at once. The local recovery logic kicked in. Each affected node lost its replication connection, treated it as a failed mirror, and started searching the cluster for free capacity to re-mirror to.
The cluster did not have enough free capacity for all of them. Within minutes the search exhausted what was available. The nodes did not back off; they kept looping. A separate low-probability race condition in the EBS node code, triggered by concurrent closes of large numbers of replication requests, started causing additional node failures, which fed back into the same search loop. The cluster spent hours unable to exit a self-sustaining degraded state: volumes could not serve I/O without a mirror, and no mirrors were available to give them.
The blast radius reached RDS. Single-AZ RDS instances in the affected zone hit stuck I/O at the same rate as the EBS volumes they sat on — 45% at peak. Multi-AZ RDS, the feature designed to survive exactly this kind of storage event, hit a previously un-encountered bug. The rapid succession of the network interruption and the primary's stuck I/O left 2.5% of Multi-AZ instances in the region in a state where automatic failover could not safely proceed without risking data loss. Those instances had to be repaired by hand.
Containment came in pieces. AWS disabled Create Volume requests in the affected AZ at 02:40 PDT to keep new work from arriving. Latencies on the other EBS APIs recovered by 02:50. At 08:20 AWS cut the degraded cluster off from the regional control plane to stop it from continuing to pressure shared components. A cluster stabilisation change at 11:30 cleared EC2 launch errors and contained the outage to a single Availability Zone by 12:04. The actual recovery, though, ran on a different clock. AWS added physical capacity to the cluster overnight on April 22 so that the stuck volumes had somewhere to mirror to. About 97.8% of volumes were back by midday April 22. The remaining tail required individual attention and ran through April 24, when 98.96% of affected EBS volumes had been recovered. The unrecoverable tail was 0.07% of EBS volumes and 0.4% of RDS instances in the affected AZ, the worst case for customers who depended only on what was in the single zone.
The mechanism worth carrying away is not "a network change was wrong". Network changes go wrong, and the redundant design exists for that. The mechanism is the shape of the failure: a locally correct self-repair behaviour, multiplied by every node that needed it at the same moment, with no aggregate brake and no exit when capacity ran out. The misrouted traffic was the trigger. The cluster's inability to back off was the storm.
The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.// AWS, Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
From the first signal to all-clear in ~4 days.
A misrouted upgrade and a recovery mechanism with no brake.
AWS was performing a routine upgrade on the primary EBS network in one US-East Availability Zone. The standard step was to shift traffic off one of the redundant routers on the primary network so it could be upgraded. The shift was executed incorrectly, and the traffic landed on the lower-capacity secondary EBS network instead of the other primary router. For the affected nodes, both networks were now non-functional — unlike a typical single-router failure, where the other primary router would have carried the load.
EBS volumes are mirrored across nodes for durability. When a node loses its replication connection, it searches the cluster for free capacity to establish a new mirror. The misrouted upgrade dropped replication connections for a large number of nodes at once, and each affected node began searching for mirrors simultaneously. Free capacity in the cluster was exhausted within minutes. The nodes lacked aggressive backoff and continued searching in a tight loop, generating enough internal traffic to keep the cluster from settling. A separate race condition in the EBS node code — triggered by closing large numbers of replication requests concurrently — caused additional node failures that fed the same loop.
The cluster could not exit the loop on its own. Stuck volumes could not serve I/O without completing a mirror, and they could not complete a mirror because the cluster had no capacity to give them. RDS Multi-AZ instances that should have failed over to a healthy secondary hit a previously un-encountered bug: the rapid succession of the network interruption and the stuck I/O on the primary triggered a state in which automatic failover could not safely proceed without risk of data loss. 2.5% of Multi-AZ RDS instances in the region had to be repaired by hand rather than failed over. Full recovery required adding physical capacity to the cluster and bringing volumes back manually over four days.