FM-003Amazon Web Services2017-02-28impact 4h 17mSEV-1

The four-hour S3 typo.

How a maintenance command with one wrong input removed too much capacity from two foundational S3 subsystems in us-east-1, and why recovery was gated by cold-start paths that had not run at large-region scale in years.

storagetoolingblast-radius

summary

Before S3 could return or place an object, it had to know where that object belonged. The index subsystem tracked object metadata and location; the placement subsystem decided where new objects should go. Both needed enough live capacity to absorb routine maintenance. Remove too much capacity too quickly, and a regional service that normally hides server churn no longer has a safe path for reads, writes, deletes, lists, or new object placement.

On the afternoon of February 28, 2017, an engineer was investigating a billing issue in us-east-1. The debugging step was familiar: take a small set of servers out of the billing subsystem to isolate the problem, then put them back. The team had done it before. The command took a scope argument — a number specifying how many servers to target. The engineer entered a value much larger than they intended.

The tooling did not pause. It did not check whether the removal would push a subsystem below its minimum required capacity. It did not slow the operation down. It accepted the command and ran it. The engineer had meant to remove a handful of servers from a billing-related subsystem. The input removed a much larger set of servers supporting the index and placement subsystems.

Within minutes, S3 could no longer service GET, LIST, PUT, or DELETE requests in us-east-1. The failure also reached AWS services that depended on regional S3 storage: EC2 new instance launches, EBS volumes that needed S3 snapshots, Lambda, and the S3 console. AWS also hit a communication problem. The Service Health Dashboard remained visible, but the administration console used to update individual service status depended on S3, so AWS had to use the AWS Twitter feed and dashboard banner text until that path came back.

Restoring service meant fully restarting both affected subsystems. The procedures existed and worked in development, but AWS had not completely restarted the index or placement subsystems in its larger regions for years. S3 had grown, and the safety checks required to validate metadata integrity took longer than expected at us-east-1 scale. The recovery was not just "turn the servers back on"; it was a staged return of capacity through systems that had to prove their metadata was still safe to serve.

The index activated enough capacity to serve GET, LIST, and DELETE at 20:26 UTC and fully recovered at 21:18. PUT still needed placement. Placement finished at 21:54, four hours and seventeen minutes after the command ran, and dependent AWS services then began clearing their backlogs. One wrong input triggered the outage; the load-bearing design flaw was a maintenance tool that could remove capacity faster than the subsystem could safely lose it.

We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.// AWS service disruption summary, February 2017

timeline · UTC

From the first signal to all-clear in 4h 17m.

17:37 UTC

Maintenance command issued with wrong scope

An S3 team member runs an established playbook command intended to remove a small number of servers from a billing-related subsystem. One input is entered incorrectly, and the command removes a much larger set of servers than intended.

17:39 UTC

Index and placement capacity removed

The removed servers support the index subsystem, which tracks object metadata and location, and the placement subsystem, which allocates storage for new writes. Both subsystems require full restarts before S3 can serve requests normally.

17:44 UTC

S3 APIs and dependent services fail

S3 cannot service GET, LIST, PUT, and DELETE requests in us-east-1 while the affected subsystems restart. AWS services that rely on S3 in the region, including EC2 launches, EBS snapshot-backed volumes, Lambda, and the S3 console, are also impacted.

18:30 UTC

Full subsystem restarts required

Engineers determine that the index and placement subsystems must be restarted. The paths have been exercised in development, but AWS has not fully restarted these subsystems in its larger regions for years.

19:18 UTC

Index restart begins

Restart proceeds slower than expected because metadata safety checks and capacity activation take longer at us-east-1 scale. Engineers bring capacity back carefully rather than forcing the subsystem online all at once.

19:37 UTC

Service Health Dashboard updates resume

AWS can update individual service status again after relying on the AWS Twitter feed and banner text. The Service Health Dashboard administration console had depended on S3, delaying normal status updates.

20:26 UTC

Index serves reads and deletes

The index subsystem activates enough capacity to begin serving GET, LIST, and DELETE requests. PUT requests still require the placement subsystem to recover.

21:18 UTC

Index fully recovered

The index subsystem fully recovers. The placement subsystem continues restoring the capacity needed for new object placement and PUT requests.

21:54 UTC

Placement recovered

The placement subsystem finishes recovery and S3 returns to normal operation in us-east-1. Other AWS services begin draining backlogs accumulated while S3 APIs were unavailable.

root cause

A blunt instrument on a tight system.

An engineer entered the wrong scope argument into a maintenance command, removing a much larger set of servers than intended from two S3 subsystems in us-east-1. The command was part of an established billing-debugging playbook, but the tool allowed too much capacity to be removed too quickly. A mistaken input could cross from routine maintenance into regional subsystem failure without a safeguard stopping it.

Once the capacity was removed, S3 could not simply reverse the command and continue. The index and placement subsystems each required a full restart. AWS had tested those paths in development, but had not completely restarted the subsystems in its larger regions for years. Growth in S3 meant metadata integrity checks and capacity activation took longer than expected.

The two affected subsystems had shared fate during recovery. The index managed object metadata and location information needed for GET, LIST, PUT, and DELETE requests. The placement subsystem allocated storage for new objects and required a functioning index to operate correctly. PUT recovery could not complete until both paths were healthy.

contributing factors

What made a small mistake into a long outage.

Cold-start drift

AWS had not completely restarted the index or placement subsystems in its larger regions for years. The restart paths worked in development, but production scale exposed longer metadata safety checks and capacity activation delays. The team learned the real restart time during the incident.

Index and placement had shared recovery fate

The placement subsystem depended on the index to allocate storage for new objects. Even after enough index capacity returned for GET, LIST, and DELETE, PUT recovery still required placement to finish. The dependency made the outage last until the slower combined recovery path completed.

Status dashboard administration depended on S3

AWS could not update individual Service Health Dashboard entries until 19:37 UTC because the dashboard administration console depended on S3. AWS used the AWS Twitter feed and dashboard banner text as a temporary response channel. The customer-facing communication path still had a dependency on the service in outage.

Unbounded blast radius in the tooling

The maintenance tool allowed too much capacity to be removed too quickly. It did not prevent an incorrect input from taking a subsystem below the capacity needed to remain healthy. The missing safety check turned a routine playbook step into a regional control action.

Large subsystems needed smaller cells

AWS had already been refactoring S3 into smaller partitions called cells, but the index subsystem still had more partitioning work scheduled for later in the year. Smaller cells make it possible to test recovery paths completely and limit how much of a large region a single operational mistake can affect.

lessons

What to take from this incident.

Cap blast radius at the tool, not at the operator.Dangerous maintenance tools should know the minimum safe capacity for the subsystem they modify. If a command would push a service below that threshold, the tool should slow down, stage the change, or refuse to run. Operator care is not a substitute for a hard guardrail.

Partition large services so recovery can be tested completely.A recovery path that only works in development is not proven for a large production region. Smaller cells let teams restart, validate, and time recovery paths on production-shaped slices without risking the whole service. Partitioning is not just blast-radius control; it is how recovery drills become real.

Model recovery dependencies, not just runtime dependencies.Placement depended on the index during normal operation and during recovery. Incident plans should make those dependencies explicit: which subsystem must recover first, which APIs return at each stage, and which backlogs remain after the main service is healthy.

Incident communication tools must not depend on the service in outage.AWS could not update individual Service Health Dashboard entries because the administration console depended on S3. Status systems need an out-of-band update path that remains writable when the affected service is down. Banner text and social channels are useful fallbacks, but they should not be the primary control surface.

Track recovery progress by customer-visible capability.The index had enough capacity for GET, LIST, and DELETE before PUT fully recovered through placement. During recovery, report which operations work, which are still blocked, and which dependent services are draining backlog. A single green subsystem status can hide the work customers still experience as downtime.

sources

Read the original.

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

aws.amazon.com ↗

← previous

FM-002 · A 43-second partition splits GitHub's database for a day

FM-004 · Facebook withdraws its own DNS from the internet