The library. Every incident, structured.
A growing archive of public postmortems, broken down into a consistent shape: what broke, why it cascaded, and what to take from it. New incidents added regularly.
28+
incidents
11+
years
13
organizations
/
sort
3 results · filtered
topic: networking
id
incident
org
date
duration
severity
tags
FM-004
The Day Facebook Deleted Its Own Route to the InternetA backbone command issued to assess global capacity unintentionally took down all of Facebook's backbone. The audit tool that was supposed to block such a command had a bug, and the DNS that announced Facebook to the world withdrew itself in response.
Facebook
2021-10-04
~6h
SEV-1
bgpdnsnetworking
FM-014
The Automation Bug That Took Google's Network Control Plane OfflineA bug in Google's datacenter maintenance automation descheduled the network control plane in multiple physical locations at once. BGP withdrew within minutes, and traffic flowed onto an oversubscribed fail-static path until engineers could rebuild the configuration.
Google Cloud
2019-06-02
4h 25m
SEV-1
networkingautomationbgp
FM-013
The EBS Self-Repair Storm That Couldn't Stop ItselfA network change in US-East shifted traffic to a low-capacity path. EBS nodes that lost their replication connection began re-mirroring at the same time, exhausting free capacity and stranding volumes in a loop the cluster could not exit on its own.
AWS
2011-04-21
~4 days
SEV-1
ebsrdsus-east