Library — Failure Modes — Failure Modes

/

sort

newsletter

Get the next incident.

One production failure case study every week. Read the pattern before it shows up in your own system.

4 results · filtered

topic: cloud

id

incident

org

date

duration

severity

tags

The Cleanup Script That Deleted 883 Atlassian SitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.

jiraconfluenceopsgenie

The Automation Bug That Took Google's Network Control Plane OfflineA bug in Google's datacenter maintenance automation descheduled the network control plane in multiple physical locations at once. BGP withdrew within minutes, and traffic flowed onto an oversubscribed fail-static path until engineers could rebuild the configuration.

networkingautomationbgp

The Impossible Date That Broke Azure VM StartupA leap-day bug stopped new Azure VMs from joining the control plane globally, then a rushed recovery update disconnected VMs in seven clusters.

Microsoft Azure

leap-yearvm-startupcertificate

When Heroku's Whole Platform Shared One AWS Failure DomainWhen AWS's US-East EBS storage entered a re-mirroring storm, Heroku's dynos, Heroku Postgres databases, and management API all failed together. The platform had no other region to run in and no path to recover without AWS.