The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
28+
incidents
11+
years
13
organizations
/
sort
3 results · filtered
topic: cascade
id
incident
org
date
duration
severity
tags
FM-009
The Telemetry Rollout That Took Down ChatGPTA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.
OpenAI
2024-12-11
4h 22m
SEV-1
chatgptapikubernetes
FM-012
When Heroku's Whole Platform Shared One AWS Failure DomainWhen AWS's US-East EBS storage entered a re-mirroring storm, Heroku's dynos, Heroku Postgres databases, and management API all failed together. The platform had no other region to run in and no path to recover without AWS.
Heroku
2011-04-21
~3 days
SEV-1
awsebsus-east
FM-013
The EBS Self-Repair Storm That Couldn't Stop ItselfA network change in US-East shifted traffic to a low-capacity path. EBS nodes that lost their replication connection began re-mirroring at the same time, exhausting free capacity and stranding volumes in a loop the cluster could not exit on its own.
AWS
2011-04-21
~4 days
SEV-1
ebsrdsus-east