The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
28+
incidents
11+
years
13
organizations
/
sort
2 results · filtered
topic: service-discovery
id
incident
org
date
duration
severity
tags
FM-009
The Telemetry Rollout That Took Down ChatGPTA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.
OpenAI
2024-12-11
4h 22m
SEV-1
chatgptapikubernetes
FM-011
The Consul Restart That Turned Slack's Cache ColdAn incremental Consul agent upgrade caused memcached nodes to be deregistered and replaced. The replacements came up empty, cache hit rates collapsed, and scatter queries from the cold cache overloaded the database.
Slack
2022-02-22
~5h
SEV-2
consulmemcachedcache