The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
29+
incidents
11+
years
13
organizations
/
sort
4 results · filtered
topic: cloud
id
incident
org
date
duration
severity
tags
FM-007
The Cleanup Script That Deleted 883 Atlassian SitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.
Atlassian
2022-04-05
14d
SEV-1
jiraconfluenceopsgenie
FM-014
The Automation Bug That Took Google's Network Control Plane OfflineA bug in Google's datacenter maintenance automation descheduled the network control plane in multiple physical locations at once. BGP withdrew within minutes, and traffic flowed onto an oversubscribed fail-static path until engineers could rebuild the configuration.
Google Cloud
2019-06-02
4h 25m
SEV-1
networkingautomationbgp
FM-015
The Impossible Date That Broke Azure VM StartupA leap-day bug stopped new Azure VMs from joining the control plane globally, then a rushed recovery update disconnected VMs in seven clusters.
Microsoft Azure
2012-02-29
34h 15m
SEV-1
leap-yearvm-startupcertificate
FM-012
When Heroku's Whole Platform Shared One AWS Failure DomainWhen AWS's US-East EBS storage entered a re-mirroring storm, Heroku's dynos, Heroku Postgres databases, and management API all failed together. The platform had no other region to run in and no path to recover without AWS.
Heroku
2011-04-21
~3 days
SEV-1
awsebsus-east