Library — Failure Modes — Failure Modes

/

sort

newsletter

Get the next incident.

One production failure case study every week. Read the pattern before it shows up in your own system.

3 results · filtered

topic: cascade

id

incident

org

date

duration

severity

tags

The Telemetry Rollout That Took Down ChatGPTA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.

chatgptapikubernetes

When Heroku's Whole Platform Shared One AWS Failure DomainWhen AWS's US-East EBS storage entered a re-mirroring storm, Heroku's dynos, Heroku Postgres databases, and management API all failed together. The platform had no other region to run in and no path to recover without AWS.

The EBS Self-Repair Storm That Couldn't Stop ItselfA network change in US-East shifted traffic to a low-capacity path. EBS nodes that lost their replication connection began re-mirroring at the same time, exhausting free capacity and stranding volumes in a loop the cluster could not exit on its own.