Library — Failure Modes — Failure Modes

/

sort

newsletter

Get the next incident.

One production failure case study every week. Read the pattern before it shows up in your own system.

4 results · filtered

topic: database

id

incident

org

date

duration

severity

tags

The Cleanup Script That Deleted 883 Atlassian SitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.

jiraconfluenceopsgenie

The Consul Restart That Turned Slack's Cache ColdAn incremental Consul agent upgrade caused memcached nodes to be deregistered and replaced. The replacements came up empty, cache hit rates collapsed, and scatter queries from the cold cache overloaded the database.

consulmemcachedcache

43 Seconds of Split-Brain at GitHubA 43-second network partition between GitHub's East and West Coast sites tripped automatic failover. By the time the partition healed, both coasts had taken writes and reconciling the split took most of a day.

databasemysqlreplication

The `rm -rf` That Erased GitLab's Production DatabasetrendingA sysadmin accidentally deleted GitLab.com's production PostgreSQL database. The normal backups were broken or unsuitable, so GitLab restored from a six-hour-old LVM snapshot.

databasepostgresqlbackup