~/library/FM-013
FM-013AWS2011-04-21impact ~4 daysSEV-1

A misrouted upgrade triggers an EBS re-mirroring storm.

How a routine upgrade traffic-shift sent EBS replication traffic onto the lower-capacity secondary network instead of a peer router on the primary, why thousands of EBS nodes then searched simultaneously for new mirrors and saturated the cluster, and why an RDS Multi-AZ failover bug then prevented automatic recovery for 2.5% of Multi-AZ instances in the region.

storagenetworkingcascade

EBS durability depended on each storage node keeping a replica it could reach. A primary network carried replication traffic, and a secondary network existed for redundancy. When a node lost its replica because the peer failed or the network path went away, it searched the cluster for free capacity and re-mirrored. That behavior was locally correct and had carried EBS through normal failures for years. Its quiet assumption was that too many nodes would not start searching at once.

On the morning of April 21, 2011, AWS was performing a routine upgrade on the primary EBS network in one US-East Availability Zone. The standard step was to shift traffic off one of the redundant routers on the primary network so it could be upgraded. The shift was executed incorrectly and the traffic landed on the lower-capacity secondary EBS network instead of the other primary router. For a large set of nodes, both intended primary paths were now gone at once. The local recovery logic kicked in. Each affected node lost its replication connection, treated it as a failed mirror, and started searching the cluster for free capacity to re-mirror to.

The cluster did not have enough free capacity for all of them. Within minutes the search exhausted what was available. The nodes did not back off; they kept looping. A separate low-probability race condition in the EBS node code, triggered by concurrent closes of large numbers of replication requests, started causing additional node failures, which fed back into the same search loop. The cluster spent hours unable to exit a self-sustaining degraded state: volumes could not serve I/O without a mirror, and no mirrors were available to give them.

The blast radius reached RDS. Single-AZ RDS instances in the affected zone hit stuck I/O at the same rate as the EBS volumes they sat on — 45% at peak. Multi-AZ RDS, the feature designed to survive exactly this kind of storage event, hit a previously un-encountered bug. The rapid succession of the network interruption and the primary's stuck I/O left 2.5% of Multi-AZ instances in the region in a state where automatic failover could not safely proceed without risking data loss. Those instances had to be repaired by hand.

Containment came in pieces. AWS disabled Create Volume requests in the affected AZ at 02:40 PDT to keep new work from arriving. Latencies on the other EBS APIs recovered by 02:50. At 08:20 AWS cut the degraded cluster off from the regional control plane to stop it from continuing to pressure shared components. A cluster stabilisation change at 11:30 cleared EC2 launch errors and contained the outage to a single Availability Zone by 12:04. The actual recovery, though, ran on a different clock. AWS added physical capacity to the cluster overnight on April 22 so that the stuck volumes had somewhere to mirror to. About 97.8% of volumes were back by midday April 22. The remaining tail required individual attention and ran through April 24, when 98.96% of affected EBS volumes had been recovered. The unrecoverable tail was 0.07% of EBS volumes and 0.4% of RDS instances in the affected AZ, the worst case for customers who depended only on what was in the single zone.

The mechanism worth carrying away is not "a network change was wrong". Network changes go wrong, and the redundant design exists for that. The mechanism is the shape of the failure: a locally correct self-repair behaviour, multiplied by every node that needed it at the same moment, with no aggregate brake and no exit when capacity ran out. The misrouted traffic was the trigger. The cluster's inability to back off was the storm.

The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.// AWS, Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region

From the first signal to all-clear in ~4 days.

Apr 21, 00:47 PDT
Network change misroutes EBS replication traffic
AWS executes a routine upgrade on the primary EBS network in one US-East Availability Zone. The standard step is to shift traffic off one of the redundant routers so it can be upgraded. The traffic shift is executed incorrectly and sends the traffic to the lower-capacity secondary EBS network instead of the other primary router. Both networks are now non-functional for the affected nodes.
Apr 21, 00:48 PDT
Mass re-mirroring begins
EBS nodes that lose their replication connection treat it as a failed mirror and begin searching for free capacity to re-mirror their data. Thousands of nodes do this simultaneously. The free capacity in the cluster is exhausted within minutes, and nodes enter a tight loop searching for capacity that does not exist.
Apr 21, 02:40 PDT
Create Volume disabled in affected AZ
AWS disables Create Volume requests in the affected Availability Zone to stop new work from arriving at a cluster that cannot find capacity. Existing degraded volumes continue to search for mirrors.
Apr 21, 02:50 PDT
Other EBS API latencies recover
Latencies and error rates for non-create EBS APIs recover as the worst of the search loop is contained to the affected AZ. The stuck volumes remain stuck.
Apr 21, 05:30 PDT
Error rates climb again across the region
Error rates rise across the broader region as the affected cluster keeps pressure on shared EBS control plane components. A race condition in the EBS node code, triggered by concurrently closing large numbers of replication requests, begins causing additional node failures.
Apr 21, 08:20 PDT
Degraded cluster isolated from the control plane
AWS cuts off communication between the degraded cluster and the regional EBS control plane to prevent the cluster's behavior from continuing to degrade APIs and other Availability Zones.
Apr 21, 11:30 PDT
Cluster stabilisation change deployed; EC2 launches resume
AWS deploys a change that stabilises the affected cluster. EC2 launch errors clear and the outage is largely contained to the single Availability Zone.
Apr 21, 12:04 PDT
Impact contained to one Availability Zone
Customer impact outside the affected AZ has largely cleared. Inside the AZ, 13% of EBS volumes remain stuck and 45% of single-AZ RDS instances have stuck I/O at peak.
Apr 22, 02:00 PDT
New capacity being added to the cluster
AWS adds physical capacity to the affected cluster to give the stuck volumes somewhere to mirror to. Recovery now scales with how fast new mirrors can be established and validated, not with how fast software can be patched.
Apr 22, 12:30 PDT
About 97.8% of volumes restored
Roughly 97.8% of the affected EBS volumes have been recovered. The remaining volumes need manual intervention.
Apr 23, 11:30 PDT
Backlog processing begins
AWS begins processing the backlog of operations that built up during the incident, including delayed EBS snapshots and API calls.
Apr 23, 15:35 PDT
EBS control plane access restored
Access to the EBS control plane is restored for the affected Availability Zone. Customer-initiated API calls can resume.
Apr 23, 18:15 PDT
API access restored to affected AZ
API access in the affected AZ returns to normal.
Apr 24, 12:30 PDT
98.96% of volumes recovered
Roughly 98.96% of affected EBS volumes are recovered. About 0.07% of volumes in the affected AZ are ultimately unrecoverable. 0.4% of RDS instances in the affected AZ are unrecoverable. The incident closes after four days.

A misrouted upgrade and a recovery mechanism with no brake.

AWS was performing a routine upgrade on the primary EBS network in one US-East Availability Zone. The standard step was to shift traffic off one of the redundant routers on the primary network so it could be upgraded. The shift was executed incorrectly, and the traffic landed on the lower-capacity secondary EBS network instead of the other primary router. For the affected nodes, both networks were now non-functional — unlike a typical single-router failure, where the other primary router would have carried the load.

EBS volumes are mirrored across nodes for durability. When a node loses its replication connection, it searches the cluster for free capacity to establish a new mirror. The misrouted upgrade dropped replication connections for a large number of nodes at once, and each affected node began searching for mirrors simultaneously. Free capacity in the cluster was exhausted within minutes. The nodes lacked aggressive backoff and continued searching in a tight loop, generating enough internal traffic to keep the cluster from settling. A separate race condition in the EBS node code — triggered by closing large numbers of replication requests concurrently — caused additional node failures that fed the same loop.

The cluster could not exit the loop on its own. Stuck volumes could not serve I/O without completing a mirror, and they could not complete a mirror because the cluster had no capacity to give them. RDS Multi-AZ instances that should have failed over to a healthy secondary hit a previously un-encountered bug: the rapid succession of the network interruption and the stuck I/O on the primary triggered a state in which automatic failover could not safely proceed without risk of data loss. 2.5% of Multi-AZ RDS instances in the region had to be repaired by hand rather than failed over. Full recovery required adding physical capacity to the cluster and bringing volumes back manually over four days.

What turned a single-AZ network mistake into a four-day storage incident.

01
Traffic shift sent replication to the lower-capacity network
The upgrade procedure expected traffic to move from one primary EBS router to its peer. Instead it landed on the redundant secondary network, which was sized for a different role and could not carry the replication load. Both intended primary paths went away for the affected nodes at the same time — a failure mode the cluster's mirroring logic did not assume could happen.
02
Mass simultaneous re-mirroring with no rate limit
EBS's per-node mirror-recovery behaviour was locally correct: lose a replica, find another. There was no cluster-level rate limit on how many nodes could re-mirror at the same time, so when a network event affected a large fraction of nodes, the recovery mechanism saturated the cluster instead of healing it.
03
Search loop had no aggressive backoff
When free capacity was exhausted, nodes continued searching in tight loops. Without exponential backoff or a circuit breaker that recognised 'no capacity is available right now', the cluster generated enough internal traffic to keep itself in the degraded state and to put pressure on shared control-plane components.
04
Race condition in concurrent replication-request closes
A separate, low-probability race condition in the EBS node code was triggered when nodes concurrently closed large numbers of replication requests. The race caused additional node failures, which fed back into the same re-mirroring loop and prolonged the incident.
05
RDS Multi-AZ failover hit a previously un-encountered bug
Multi-AZ RDS instances were supposed to fail over automatically when the primary's I/O stalled. The rapid succession of network interruption and stuck I/O triggered a state the failover code had not seen before, and 2.5% of Multi-AZ instances in the region required manual intervention. The redundancy mechanism shared an unanticipated failure mode with the thing it was supposed to protect against.

What to take from this incident.

01
Rate-limit aggregate self-repair, not just per-node retries.Durability mechanisms that trigger collectively during a network event can amplify the very failure they are trying to protect against. A cluster-level limit on how many nodes can re-mirror at the same time turns a region-wide event from a saturating storm into a controlled recovery.
02
Treat 'no capacity right now' as a state the system has to handle.A node that cannot find a mirror needs to back off, not loop. Aggressive backoff, circuit breakers that recognise cluster-wide capacity exhaustion, and an explicit waiting state prevent the recovery mechanism from being its own load source.
03
Validate the destination of a traffic shift, not just the syntax of the command.A change that moves traffic between paths should check that the destination has the capacity for what is about to land on it. A traffic-engineering change without that check is one typo away from sending production load somewhere that cannot carry it.
04
Make availability features fail to safety, not to silent stuck states.RDS Multi-AZ was designed to protect against exactly this kind of storage event and hit a state where it could not act without risking data loss. When the safe-failover path is unavailable, the system should fail to a clearly observable state with an explicit human-intervention contract, not to a silent inaction that customers discover by experiencing downtime.
05
Plan for recovery that scales with capacity, not with code.Once stuck volumes existed at scale, recovery time was bounded by how fast AWS could add physical capacity and bring volumes back manually. The lesson is that some failure modes have a hardware-shaped recovery floor; planning for incidents at this scale means owning the procurement, staging, and manual procedures that floor implies.

Read the original.

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
aws.amazon.com
← previous
FM-012 · Heroku's entire platform rides one AWS region down
next →
FM-014 · An automation bug deschedules Google's network control plane