The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
29+
incidents
11+
years
13
organizations
/
sort
4 results · filtered
topic: database
id
incident
org
date
duration
severity
tags
FM-007
The Cleanup Script That Deleted 883 Atlassian SitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.
Atlassian
2022-04-05
14d
SEV-1
jiraconfluenceopsgenie
FM-011
The Consul Restart That Turned Slack's Cache ColdAn incremental Consul agent upgrade caused memcached nodes to be deregistered and replaced. The replacements came up empty, cache hit rates collapsed, and scatter queries from the cold cache overloaded the database.
Slack
2022-02-22
~5h
SEV-2
consulmemcachedcache
FM-002
43 Seconds of Split-Brain at GitHubA 43-second network partition between GitHub's East and West Coast sites tripped automatic failover. By the time the partition healed, both coasts had taken writes and reconciling the split took most of a day.
GitHub
2018-10-21
24h 11m
SEV-1
databasemysqlreplication
FM-006
The `rm -rf` That Erased GitLab's Production DatabasetrendingA sysadmin accidentally deleted GitLab.com's production PostgreSQL database. The normal backups were broken or unsuitable, so GitLab restored from a six-hour-old LVM snapshot.
GitLab
2017-01-31
18h 30m
SEV-1
databasepostgresqlbackup