The four-hour S3 typo.
How a maintenance command with one wrong input removed too much capacity from two foundational S3 subsystems in us-east-1, and why recovery was gated by cold-start paths that had not run at large-region scale in years.
Before S3 could return or place an object, it had to know where that object belonged. The index subsystem tracked object metadata and location; the placement subsystem decided where new objects should go. Both needed enough live capacity to absorb routine maintenance. Remove too much capacity too quickly, and a regional service that normally hides server churn no longer has a safe path for reads, writes, deletes, lists, or new object placement.
On the afternoon of February 28, 2017, an engineer was investigating a billing issue in us-east-1. The debugging step was familiar: take a small set of servers out of the billing subsystem to isolate the problem, then put them back. The team had done it before. The command took a scope argument — a number specifying how many servers to target. The engineer entered a value much larger than they intended.
The tooling did not pause. It did not check whether the removal would push a subsystem below its minimum required capacity. It did not slow the operation down. It accepted the command and ran it. The engineer had meant to remove a handful of servers from a billing-related subsystem. The input removed a much larger set of servers supporting the index and placement subsystems.
Within minutes, S3 could no longer service GET, LIST, PUT, or DELETE requests in us-east-1. The failure also reached AWS services that depended on regional S3 storage: EC2 new instance launches, EBS volumes that needed S3 snapshots, Lambda, and the S3 console. AWS also hit a communication problem. The Service Health Dashboard remained visible, but the administration console used to update individual service status depended on S3, so AWS had to use the AWS Twitter feed and dashboard banner text until that path came back.
Restoring service meant fully restarting both affected subsystems. The procedures existed and worked in development, but AWS had not completely restarted the index or placement subsystems in its larger regions for years. S3 had grown, and the safety checks required to validate metadata integrity took longer than expected at us-east-1 scale. The recovery was not just "turn the servers back on"; it was a staged return of capacity through systems that had to prove their metadata was still safe to serve.
The index activated enough capacity to serve GET, LIST, and DELETE at 20:26 UTC and fully recovered at 21:18. PUT still needed placement. Placement finished at 21:54, four hours and seventeen minutes after the command ran, and dependent AWS services then began clearing their backlogs. One wrong input triggered the outage; the load-bearing design flaw was a maintenance tool that could remove capacity faster than the subsystem could safely lose it.
We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.// AWS service disruption summary, February 2017
From the first signal to all-clear in 4h 17m.
A blunt instrument on a tight system.
An engineer entered the wrong scope argument into a maintenance command, removing a much larger set of servers than intended from two S3 subsystems in us-east-1. The command was part of an established billing-debugging playbook, but the tool allowed too much capacity to be removed too quickly. A mistaken input could cross from routine maintenance into regional subsystem failure without a safeguard stopping it.
Once the capacity was removed, S3 could not simply reverse the command and continue. The index and placement subsystems each required a full restart. AWS had tested those paths in development, but had not completely restarted the subsystems in its larger regions for years. Growth in S3 meant metadata integrity checks and capacity activation took longer than expected.
The two affected subsystems had shared fate during recovery. The index managed object metadata and location information needed for GET, LIST, PUT, and DELETE requests. The placement subsystem allocated storage for new objects and required a functioning index to operate correctly. PUT recovery could not complete until both paths were healthy.