FM-012Heroku2011-04-21impact ~3 daysSEV-1

Heroku's entire platform rides one AWS region down.

How a single AWS region's EBS failure took every layer of Heroku down with it — dynos, databases, and the management API — and why Heroku's recovery had to wait for AWS because the management plane it would have used to move workloads was running on the same failing storage.

cloudstoragecascade

summary

In 2011, Heroku's platform had one dominant failure domain: AWS US-East EBS. Dynos — the lightweight containers that held customer applications — lived on EC2 instances backed by EBS volumes in that region. Heroku Postgres databases, shared and dedicated, used EBS as their storage. The management API customers called to deploy, scale, and restart applications depended on the same EBS-backed infrastructure, as did the internal tooling Heroku engineers used to triage and operate the platform. As long as US-East EBS served I/O quickly and reliably, every layer worked. If that storage layer failed broadly enough, Heroku had no independent place from which to run the platform or manage its recovery.

On the morning of April 21, 2011, AWS's US-East EBS entered a re-mirroring storm. The trigger was on AWS's side — a network configuration change during a scaling operation that sent traffic to a lower-capacity path, causing a wave of EBS volumes to seek new mirrors and saturate the network they needed to do so. Heroku felt it as I/O slowing down everywhere at once. Within an hour and a half, dynos were exiting and could not restart. Within two hours, Postgres databases were unreachable. Within three, Heroku's own management API was unreliable enough that operators could not use standard tooling to do anything about it.

There was nowhere to fail over to. Heroku had no capacity outside US-East. Customer databases had no region-independent recovery path: backups existed, but they were backups of the same storage that was now broken, and there was no warm standby in another region that could take over while EBS recovered. The management plane that operators would have used to migrate work or coordinate a partial recovery was itself on the failing storage. Engineers could communicate, prioritize, and stage what they could, but the recovery clock was set by AWS.

As AWS restored portions of the EBS fleet through the morning, Heroku's management tooling came back enough to begin triaging. Shared Postgres databases and the largest dyno groups were prioritised. About 12 hours after the initial impact, Heroku's primary services were substantially back as AWS completed EBS recovery in the main impact zone. The long tail lasted for days. Volumes that AWS could not bring back had to be replaced with snapshot restores. A small number of customer volumes were permanently lost on the AWS side and rebuilt from Heroku's own backups.

The triggering event was not Heroku's. The scope of Heroku's impact was. Running every layer of a platform inside a single region's storage system makes the platform's worst-case recovery time equal to that storage system's worst-case recovery time. The mechanism worth carrying away is not "AWS had a bad day". It is that a platform's recovery options are decided long before the incident — by how many failure domains the platform spans, by whether its management plane is independent of its data plane, and by whether customer state has a path back that does not depend on the system that is currently broken. Heroku had none of those options on April 21, and the recovery looked like it.

Our service is heavily dependent on Amazon Web Services, particularly Elastic Block Storage. When EBS suffered a severe outage, many parts of our service ground to a halt.// Heroku, Post-mortem on April 21 outage

timeline · UTC

From the first signal to all-clear in ~3 days.

Apr 21, 00:47 PDT

AWS US-East EBS storm begins

An AWS network configuration change in US-East triggers a mass EBS re-mirroring event. EBS performance degrades sharply across the region. Heroku's dynos and Heroku Postgres databases are EBS-backed and start to feel it within minutes.

Apr 21, ~01:30 PDT

Dynos begin failing

Heroku's dynos run on EC2 instances backed by EBS volumes. As EBS I/O degrades, application containers exit, restarts fail, and the scheduler cannot place new dynos. Customers see their apps stop responding.

Apr 21, ~02:00 PDT

Heroku Postgres databases unavailable

Heroku's shared and dedicated Postgres databases are also EBS-backed. Database I/O stalls. Connections time out and queries fail. Customers depending on Heroku Postgres lose access to their data.

Apr 21, ~03:00 PDT

Management plane loses ability to remediate

Heroku's own platform API — the surface customers use to deploy, scale, and restart applications, and the surface Heroku operators use to triage — runs on the same EBS-backed infrastructure. Standard tooling becomes unreliable. There is no other region to fail over to.

Apr 21, ~06:00 PDT

AWS partial recovery; triage begins

As AWS recovers parts of the EBS fleet, Heroku regains limited use of its management tooling and begins prioritising recovery for shared Postgres databases and high-traffic dyno groups.

Apr 21, ~12:45 PDT

Primary services substantially restored

Heroku's dyno grid and most Postgres databases are back as AWS completes EBS recovery in the primary impact zone. Roughly 12 hours of primary impact ends. Some customers with severely degraded volumes still require additional time.

Apr 22 – Apr 24

Long tail of individual customer recovery

AWS continues to manually recover EBS volumes that remained stuck in the re-mirroring loop. Heroku works through affected customers individually, restoring databases from snapshots where AWS could not bring the volume back.

Apr 24

Most volumes recovered across the platform

AWS reports the majority of affected EBS volumes recovered. A small number of customers with unrecoverable volumes are restored from Heroku's own snapshot backups. Heroku publishes its post-mortem.

root cause

A platform that ran entirely on one region's storage.

The triggering event was not Heroku's. AWS's US-East EBS storage entered a re-mirroring storm on the morning of April 21, 2011, and storage I/O across the region degraded sharply. The scope of Heroku's impact was its own decision: every layer of the platform — dynos, Heroku Postgres databases, and the management API — ran exclusively in AWS US-East, on EBS-backed infrastructure. When the region failed, the platform failed with it.

Heroku's own operational tooling lived on the same EBS-backed infrastructure it was supposed to manage. The platform API that customers used to deploy and scale, and the internal tooling Heroku operators used to triage and recover customer workloads, ran on the storage layer that was failing. With the management plane degraded, engineers could not use standard procedures to move work elsewhere even if other capacity had been available — and it was not.

Because Heroku had no working capacity in another region and no independent management plane, recovery was not bottlenecked by diagnosis, planning, or engineering work. It was bottlenecked by AWS restoring its own storage. For about 12 hours of primary impact and days of long-tail customer recovery, Heroku's recovery timeline equaled AWS's recovery timeline.

contributing factors

What made Heroku's blast radius equal to AWS's.

Single cloud, single region, every layer

Every component of the Heroku platform — dyno scheduling, Heroku Postgres, the management API — ran exclusively in AWS US-East. There was no capacity in another region or on another provider to absorb the failure or even to host an emergency control plane while US-East recovered.

Management plane co-located with customer workloads

Heroku's operational tooling ran on the same EBS-backed infrastructure it was meant to manage. When the data plane degraded, the control plane degraded with it. Engineers had to do work that depended on the very system that was failing.

Heroku Postgres had no region-independent recovery path

Customer databases were backed by EBS in the same region as the application. There was no cross-region replication, no continuous archival to independent storage, and no warm standby that could take over when EBS could not serve I/O. Recovery for a stuck volume meant either AWS restoring it or restoring from a snapshot, neither of which was fast.

Recovery time depended on a third party

Without a path to migrate workloads or to restore from independent storage, Heroku's recovery clock was AWS's recovery clock. Heroku could communicate, prioritize, and stage the long-tail customer work, but the floor on duration came from outside the platform.

Vendor SLA was not the same as Heroku's SLA

AWS's storage SLA covered AWS's storage. It did not cover the experience of Heroku's customers, who had bought platform reliability from Heroku. The mismatch between a provider's SLA and a platform's SLA only becomes visible during a long upstream incident, and at that point it is too late to design around.

lessons

What to take from this incident.

A platform must keep working capacity outside any single failure domain.Running a PaaS entirely inside one cloud provider region inherits all of that region's failure modes. Maintaining warm capacity in another region — even at reduced scale — gives the platform a place to move work and a place to operate from while the primary recovers.

Keep the management plane out of the data plane it manages.If the tools used to recover a platform run on the infrastructure that is failing, the recovery cannot start until the infrastructure starts coming back. The control plane needs independent hosting, independent storage, and an independent failure profile so engineers can act when the customer-facing layer cannot.

Customer databases need a recovery path that does not depend on the failing storage.Cross-region replication, continuous archival of write-ahead logs to independent storage, or warm standbys give a database a way back when its primary storage cannot serve I/O. Backups in the same failure domain as the live database are only useful if that domain comes back.

Plan the recovery you control, not the SLA you bought.A provider's SLA is a billing arrangement, not a recovery plan. The recovery time customers actually experience is the recovery time the platform can achieve with the tools it owns. Design and rehearse the recovery path against worst-case provider outages, not against the provider's expected reliability.

Communicate honestly about which timeline you are on.When a platform's recovery is gated entirely on a third party, the cleanest customer message is the truthful one: 'our recovery is waiting on the upstream provider; here is what we are doing in the meantime'. Pretending the platform is independently restoring service when it is not damages the trust the long tail of recovery will need.

sources

Read the original.

Post-mortem on April 21 outage

heroku.com ↗

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region

aws.amazon.com ↗

← previous

FM-011 · A Consul agent restart empties Slack's cache

FM-013 · A misrouted upgrade triggers an EBS re-mirroring storm