The library. Every incident, structured.

A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.

28+
incidents
11+
years
13
organizations

newsletter

Get the next incident.

One production failure case study every week. Read the pattern before it shows up in your own system.

28 results
id
incident
org
date
duration
severity
tags
FM-018
The Overheated AWS ZonetrendingA thermal event in one US-EAST-1 data center impaired EC2 instances and EBS volumes in use1-az4, disrupting workloads that depended on resources pinned to the affected Availability Zone.
AWS
2026-05-08
20h 30m
SEV-2
us-east-1use1-az4thermal-event
FM-019
The Encryption Path Under Slack MessagesSlack EKM customers experienced message sending, channel loading, workflow, notification, DM, and file-operation issues after elevated encryption-key request load turned a security dependency into an availability bottleneck.
Slack
2026-05-08
3h 21m
SEV-2
enterprise-key-managementkmscaching
FM-017
The DNSSEC Failure That Made .de Look FakeIncorrect DNSSEC signatures for Germany's .de top-level domain caused validating resolvers to reject .de answers, leading Cloudflare to temporarily bypass DNSSEC validation for the zone.
Cloudflare
2026-05-05
2h 47m
SEV-2
denicde-dnssecdns
FM-016
The Search Layer That Slowed GitHubA concentrated wave of anonymous scraping traffic saturated the load-balancing tier in front of GitHub Search, causing timeouts across issues, pull requests, repositories, Actions, packages, and Dependabot alerts.
GitHub
2026-04-27
6h 31m
SEV-2
searchscrapingload-balancing
FM-029
The Silent Merge Queue Corruption That Hit 658 GitHub ReposA half-gated feature flag let an unreleased merge-base path escape into squash merge groups. Over a 4h38m impact window, GitHub's merge queue produced valid-looking commits that silently reverted prior work across 658 repositories and 2,092 pull requests.
GitHub
2026-04-23
4h38m
SEV-1
merge-queuefeature-flagsquash-merge
FM-026
The WAF Killswitch That Crashed the Older ProxyA global WAF testing-tool killswitch exposed an FL1 proxy bug, returning HTTP 500s for sites using the older proxy and Managed Ruleset.
Cloudflare
2025-12-05
25m
SEV-2
wafconfigproxy
FM-022
The Bot File That Crashed Cloudflare's ProxyA ClickHouse permissions change duplicated Bot Management feature rows, producing an oversized file that crashed Cloudflare proxy traffic paths.
Cloudflare
2025-11-18
5h 38m
SEV-1
configwafbot-management
FM-024
Azure Front Door config crashes edge sitesIncompatible Azure Front Door customer metadata exposed a data-plane bug, crashing edge sites and causing connection timeouts and DNS errors globally.
Microsoft Azure
2025-10-29
8h 24m
SEV-1
configedgecdn
FM-020
The DynamoDB DNS Race That Emptied US-EAST-1A DynamoDB DNS automation race emptied the US-EAST-1 regional endpoint, then cascaded into EC2 launch, NLB, Lambda, and console failures.
AWS
2025-10-19
14h 32m
SEV-1
dnsautomationdynamodb
FM-023
Service Control quota update rejects APIsA quota policy update with blank fields crashed Google Service Control globally, causing 503s across Google Cloud, Workspace, and Security Operations APIs.
Google Cloud
2025-06-12
3h
SEV-1
configquotaspanner
1 / 3