The library. Every incident, structured.

A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incident every Tuesday.

15+
incidents
9+
years
1.3k+
subscribers
all15
automation1
backup1
bgp1
blast-radius1
cache1
cascade4
cdn1
cloud5
config2
control-plane1
database3
datacenter1
deploy5
dns1
failover1
ha1
kubernetes1
networking4
operator-error2
replication1
scaling1
service-discovery1
storage3
tooling1
waf1
all15
AWS1
Amazon Web Services1
Atlassian1
Cloudflare2
Facebook1
Fastly1
GitHub1
GitLab1
Google Cloud1
Heroku1
Microsoft Azure1
OpenAI1
Slack2
15 results
id
incident
org
date
duration
severity
tags
FM-009
A telemetry rollout takes down ChatGPT for four hoursA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.
OpenAI
2024-12-11
4h 22m
SEV-1
deploykubernetescascade
FM-008
Cloudflare's control plane loses its primary facilityA cascading power failure took out Cloudflare's primary control plane facility. The high-availability cluster did not survive the loss of one of its three sites, and the dashboard, API, and analytics went down while the data plane kept serving customer traffic.
Cloudflare
2023-11-02
~36h
SEV-1
datacenterhacontrol-plane
FM-007
A maintenance script deletes 883 customer sitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.
Atlassian
2022-04-05
14d
SEV-1
clouddatabaseoperator-error
FM-011
A Consul agent restart empties Slack's cacheAn incremental Consul agent upgrade caused memcached nodes to be deregistered and replaced. The replacements came up empty, cache hit rates collapsed, and scatter queries from the cold cache overloaded the database.
Slack
2022-02-22
~5h
SEV-2
cacheservice-discoverydeploy
FM-004
Facebook withdraws its own DNS from the internetA backbone command issued to assess global capacity unintentionally took down all of Facebook's backbone. The audit tool that was supposed to block such a command had a bug, and the DNS that announced Facebook to the world withdrew itself in response.
Facebook
2021-10-04
~6h
SEV-1
networkingbgpdns
FM-005
A latent CDN bug, woken by a valid config changeA software release shipped 27 days earlier left a latent bug in Fastly's edge platform. A routine, valid customer configuration change triggered it and 85% of Fastly's network began returning errors within seconds.
Fastly
2021-06-08
~1h
SEV-1
cdndeployconfig
FM-010
Slack's first day back: a Transit Gateway runs out of roomOn the first Monday after the holiday break, an AWS Transit Gateway saturated under Slack's return-to-work traffic. Packet loss hit the web tier just as autoscaling tried to add 1,200 instances, and the provisioning service collapsed under its own quota and file-descriptor limits.
Slack
2021-01-04
~3h 40m
SEV-2
cloudscalingcascade
FM-001
A WAF rule pegs every Cloudflare CPU at onceA new managed WAF rule contained a regex that backtracked exponentially on live HTTP traffic, spiking CPU to nearly 100% across every edge server worldwide within seconds of deployment.
Cloudflare
2019-07-02
27m
SEV-1
networkingdeploywaf
FM-014
An automation bug deschedules Google's network control planeA bug in Google's datacenter maintenance automation descheduled the network control plane in multiple physical locations at once. BGP withdrew within minutes, and traffic flowed onto an oversubscribed fail-static path until engineers could rebuild the configuration.
Google Cloud
2019-06-02
4h 25m
SEV-1
networkingautomationcloud
FM-002
A 43-second partition splits GitHub's database for a dayA 43-second network partition between GitHub's East and West Coast sites tripped automatic failover. By the time the partition healed, both coasts had taken writes and reconciling the split took most of a day.
GitHub
2018-10-21
24h 11m
SEV-1
databasereplicationfailover
1 / 2