Library — Failure Modes — Failure Modes

/

sort

15 results

id

incident

org

date

duration

severity

tags

A telemetry rollout takes down ChatGPT for four hoursA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.

deploykubernetescascade

Cloudflare's control plane loses its primary facilityA cascading power failure took out Cloudflare's primary control plane facility. The high-availability cluster did not survive the loss of one of its three sites, and the dashboard, API, and analytics went down while the data plane kept serving customer traffic.

datacenterhacontrol-plane

A maintenance script deletes 883 customer sitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.

clouddatabaseoperator-error

A Consul agent restart empties Slack's cacheAn incremental Consul agent upgrade caused memcached nodes to be deregistered and replaced. The replacements came up empty, cache hit rates collapsed, and scatter queries from the cold cache overloaded the database.

cacheservice-discoverydeploy

Facebook withdraws its own DNS from the internetA backbone command issued to assess global capacity unintentionally took down all of Facebook's backbone. The audit tool that was supposed to block such a command had a bug, and the DNS that announced Facebook to the world withdrew itself in response.

networkingbgpdns

A latent CDN bug, woken by a valid config changeA software release shipped 27 days earlier left a latent bug in Fastly's edge platform. A routine, valid customer configuration change triggered it and 85% of Fastly's network began returning errors within seconds.

cdndeployconfig

Slack's first day back: a Transit Gateway runs out of roomOn the first Monday after the holiday break, an AWS Transit Gateway saturated under Slack's return-to-work traffic. Packet loss hit the web tier just as autoscaling tried to add 1,200 instances, and the provisioning service collapsed under its own quota and file-descriptor limits.

cloudscalingcascade

A WAF rule pegs every Cloudflare CPU at onceA new managed WAF rule contained a regex that backtracked exponentially on live HTTP traffic, spiking CPU to nearly 100% across every edge server worldwide within seconds of deployment.

networkingdeploywaf

An automation bug deschedules Google's network control planeA bug in Google's datacenter maintenance automation descheduled the network control plane in multiple physical locations at once. BGP withdrew within minutes, and traffic flowed onto an oversubscribed fail-static path until engineers could rebuild the configuration.

networkingautomationcloud

A 43-second partition splits GitHub's database for a dayA 43-second network partition between GitHub's East and West Coast sites tripped automatic failover. By the time the partition healed, both coasts had taken writes and reconciling the split took most of a day.

databasereplicationfailover

1 / 2