The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
28+
incidents
11+
years
13
organizations
/
sort
1 result · filtered
OpenAI
id
incident
org
date
duration
severity
tags
FM-009
The Telemetry Rollout That Took Down ChatGPTA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.
OpenAI
2024-12-11
4h 22m
SEV-1
chatgptapikubernetes