Facebook withdraws its own DNS from the internet.
How a routine command intended to assess backbone capacity instead disconnected Facebook's entire backbone, why the safety tool meant to stop it did not, and why DNS for Facebook, Instagram, WhatsApp, and Oculus then withdrew itself from BGP and erased the company from the internet for hours.
Before anyone could reach Facebook, the internet had to find Facebook's nameservers. Those authoritative DNS servers lived at edge facilities connected back to Facebook data centers by the company's private backbone. Their safety rule was sensible in isolation: if a DNS server could not reach the network it served, it should withdraw its BGP routes so traffic did not land on a server that could not answer. As long as the backbone stayed up, the rule protected users from bad answers. If the backbone disappeared everywhere, every DNS server would conclude it was unsafe to announce itself, and the public internet would lose its path to Facebook, Instagram, WhatsApp, Messenger, and Oculus at once.
On October 4, 2021, an engineer on the backbone team ran a command intended to assess the availability of global backbone capacity. The command was part of routine maintenance work. Facebook had an audit tool that was supposed to catch commands with wide blast radius before they ran. A bug in that tool prevented it from stopping this one. The command proceeded, and Facebook's entire backbone disconnected.
What happened next was the DNS policy doing exactly what it was designed to do. Every authoritative DNS server in every edge facility noticed it could not reach the network it served for. Each one withdrew its BGP advertisements. The withdrawals were correct in isolation and catastrophic in aggregate. From the rest of the internet's point of view, Facebook's nameservers had simply ceased to exist. Resolvers had no path to find them. Facebook, Instagram, WhatsApp, Messenger, Workplace, and Oculus all became unresolvable at the same instant.
The blast radius reached the people who would have fixed it. Facebook's internal tools — remote shells, VPNs, the authentication chain, monitoring, coordination — depended on the same backbone and the same DNS chain that had just gone away. Engineers who could normally have logged in remotely to diagnose the network could not, because the network they were used to logging in over was not there. Coordination moved to channels that did not assume any of Facebook's infrastructure was up.
Recovery required someone physically at the data centers. Facebook had built those facilities to be hard to get into, with strong physical and system security — exactly the right design for normal operating conditions, and exactly the wrong design for an incident in which the recovery path runs through the front door. On-site teams were dispatched, secure access procedures were activated, and the backbone was brought back up carefully, in stages, to avoid power and load spikes that would have caused new failures during the recovery itself. As backbone connectivity returned, the DNS servers re-announced their BGP routes, resolvers worldwide could find Facebook's nameservers again, and the services came back.
For about six hours, Facebook disappeared because a maintenance command escaped the audit tool that should have refused it. The DNS policy was sound; the backbone collapse was not. The lesson is sharper because the safeguards were not missing. The audit tool existed. The DNS withdrawal policy existed. Physical security existed. Each safeguard either did exactly what it was designed to do, or failed quietly at the moment it was needed, and the dependencies between them were dense enough that there was no independent path left to fall back on.
A bug in that audit tool prevented it from properly stopping the command.// Meta Engineering, More details about the October 4 outage
From the first signal to all-clear in ~6h.
A safety tool that did not stop a dangerous command, and a DNS policy that did its job too well.
The triggering event was a command issued during routine maintenance work to assess the availability of global backbone capacity. The command had much wider blast radius than intended, and when it ran it disconnected Facebook's backbone — the high-capacity network that ties together the company's data centers and edge facilities.
Facebook had an audit tool whose role was to catch commands like this one before they ran. The tool was designed precisely to stop maintenance operations that would take down infrastructure. A bug in the audit tool let the dangerous command through. The first line of defence was not missing; it was broken, and nobody knew until the command needed it to work.
The DNS-withdrawal behaviour was correct in isolation and catastrophic in aggregate. Facebook's authoritative DNS servers sat in smaller edge facilities and used backbone connectivity to reach the data centers that held the actual content of the network. They were configured to withdraw their BGP advertisements if they detected that they could not reach the network they served for — a sensible policy that kept traffic away from servers that could not answer queries. When the backbone went down, every DNS server reached that condition at once and withdrew its routes. The public internet no longer had a path to Facebook's nameservers, and everything keyed on Facebook DNS — including some internal tools — failed along with the public products.