FM-004Facebook2021-10-04impact ~6hSEV-1

Facebook withdraws its own DNS from the internet.

How a routine command intended to assess backbone capacity instead disconnected Facebook's entire backbone, why the safety tool meant to stop it did not, and why DNS for Facebook, Instagram, WhatsApp, and Oculus then withdrew itself from BGP and erased the company from the internet for hours.

networkingbgpdns

summary

Before anyone could reach Facebook, the internet had to find Facebook's nameservers. Those authoritative DNS servers lived at edge facilities connected back to Facebook data centers by the company's private backbone. Their safety rule was sensible in isolation: if a DNS server could not reach the network it served, it should withdraw its BGP routes so traffic did not land on a server that could not answer. As long as the backbone stayed up, the rule protected users from bad answers. If the backbone disappeared everywhere, every DNS server would conclude it was unsafe to announce itself, and the public internet would lose its path to Facebook, Instagram, WhatsApp, Messenger, and Oculus at once.

On October 4, 2021, an engineer on the backbone team ran a command intended to assess the availability of global backbone capacity. The command was part of routine maintenance work. Facebook had an audit tool that was supposed to catch commands with wide blast radius before they ran. A bug in that tool prevented it from stopping this one. The command proceeded, and Facebook's entire backbone disconnected.

What happened next was the DNS policy doing exactly what it was designed to do. Every authoritative DNS server in every edge facility noticed it could not reach the network it served for. Each one withdrew its BGP advertisements. The withdrawals were correct in isolation and catastrophic in aggregate. From the rest of the internet's point of view, Facebook's nameservers had simply ceased to exist. Resolvers had no path to find them. Facebook, Instagram, WhatsApp, Messenger, Workplace, and Oculus all became unresolvable at the same instant.

The blast radius reached the people who would have fixed it. Facebook's internal tools — remote shells, VPNs, the authentication chain, monitoring, coordination — depended on the same backbone and the same DNS chain that had just gone away. Engineers who could normally have logged in remotely to diagnose the network could not, because the network they were used to logging in over was not there. Coordination moved to channels that did not assume any of Facebook's infrastructure was up.

Recovery required someone physically at the data centers. Facebook had built those facilities to be hard to get into, with strong physical and system security — exactly the right design for normal operating conditions, and exactly the wrong design for an incident in which the recovery path runs through the front door. On-site teams were dispatched, secure access procedures were activated, and the backbone was brought back up carefully, in stages, to avoid power and load spikes that would have caused new failures during the recovery itself. As backbone connectivity returned, the DNS servers re-announced their BGP routes, resolvers worldwide could find Facebook's nameservers again, and the services came back.

For about six hours, Facebook disappeared because a maintenance command escaped the audit tool that should have refused it. The DNS policy was sound; the backbone collapse was not. The lesson is sharper because the safeguards were not missing. The audit tool existed. The DNS withdrawal policy existed. Physical security existed. Each safeguard either did exactly what it was designed to do, or failed quietly at the moment it was needed, and the dependencies between them were dense enough that there was no independent path left to fall back on.

A bug in that audit tool prevented it from properly stopping the command.// Meta Engineering, More details about the October 4 outage

timeline · UTC

From the first signal to all-clear in ~6h.

~15:39 UTC

Engineer issues a backbone-capacity assessment command

An engineer on Facebook's backbone team runs a command intended to assess the availability of capacity across the global backbone. The command is part of routine maintenance work.

~15:39 UTC

Audit tool fails to block the command

Facebook has an audit tool whose job is to prevent commands like this one from being executed if they would have wide impact. A bug in the audit tool prevents it from stopping the command. The command proceeds and disconnects Facebook's entire backbone.

~15:40 UTC

DNS servers withdraw their BGP advertisements

Facebook's authoritative DNS servers in smaller edge facilities detect that they have lost backbone connectivity to the rest of Facebook's network. They follow a configured safety policy: if you cannot reach the network you serve for, withdraw your BGP routes so traffic does not arrive at a server that cannot answer.

~15:40 UTC

Facebook disappears from the internet

With the DNS prefixes withdrawn, recursive resolvers worldwide can no longer find Facebook's nameservers. Facebook, Instagram, WhatsApp, Messenger, Workplace, and Oculus become unresolvable. Globally, query rates to Facebook's nameservers fall to zero.

~15:45 UTC

Internal tools fail with the same dependencies

Internal systems used by Facebook engineers depend on the same backbone and the same DNS-driven authentication. With the backbone gone and the names unresolvable, engineers lose access to remote tools they would normally use to diagnose and repair the network. Coordination shifts to channels not dependent on Facebook's infrastructure.

~16:00 UTC

On-site teams dispatched to data centers

Engineers are sent to data centers to debug and restart systems directly. Facility access procedures, designed to be hard to bypass, add delay even with authorized staff on site.

~20:00 UTC

Backbone restoration begins

On-site teams begin bringing the backbone back. The recovery is staged carefully to avoid power and load spikes; bringing the network back too quickly risks new failures as services attempt to come up all at once.

~21:00 UTC

DNS reachable, services begin returning

Backbone connectivity is restored to enough of the network that DNS servers re-announce their BGP routes. Resolvers worldwide can find Facebook's nameservers again. The services begin coming back as traffic ramps up and load balances.

~21:45 UTC

Services largely restored

Facebook, Instagram, WhatsApp, and the rest of the affected products are largely back. Recovery continues for systems that needed more time to catch up. Total customer-visible impact: about six hours.

root cause

A safety tool that did not stop a dangerous command, and a DNS policy that did its job too well.

The triggering event was a command issued during routine maintenance work to assess the availability of global backbone capacity. The command had much wider blast radius than intended, and when it ran it disconnected Facebook's backbone — the high-capacity network that ties together the company's data centers and edge facilities.

Facebook had an audit tool whose role was to catch commands like this one before they ran. The tool was designed precisely to stop maintenance operations that would take down infrastructure. A bug in the audit tool let the dangerous command through. The first line of defence was not missing; it was broken, and nobody knew until the command needed it to work.

The DNS-withdrawal behaviour was correct in isolation and catastrophic in aggregate. Facebook's authoritative DNS servers sat in smaller edge facilities and used backbone connectivity to reach the data centers that held the actual content of the network. They were configured to withdraw their BGP advertisements if they detected that they could not reach the network they served for — a sensible policy that kept traffic away from servers that could not answer queries. When the backbone went down, every DNS server reached that condition at once and withdrew its routes. The public internet no longer had a path to Facebook's nameservers, and everything keyed on Facebook DNS — including some internal tools — failed along with the public products.

contributing factors

What turned a maintenance command into a six-hour disappearance.

Audit tool bug let a high-blast-radius command through

The audit tool that was supposed to stop maintenance commands with wide impact had a bug that prevented it from stopping this one. The safeguard existed in the right place at the right time; it just did not work. The incident showed how dangerous it is to rely on a tool whose correct operation is never tested against the cases it exists for.

DNS withdrawal magnified backbone loss into total disappearance

A locally correct policy — withdraw routes when you cannot reach the network you serve — became globally catastrophic when every DNS-serving edge facility met the condition at once. The policy had no global view that would say 'all of us are withdrawing; pause and wait'.

Internal tools depended on the same network they were used to debug

VPNs, remote management, authentication, and coordination tools all relied on the same backbone and DNS that had just gone down. Engineers who would have used remote tools to investigate and recover instead had to fall back to channels and procedures that did not assume any of that infrastructure was available.

Physical access procedures were designed to slow people down

Facebook's data centers were built to be hard to enter, with strong physical and system security. Under normal conditions those defences are exactly right. During an incident where on-site work is the recovery path, the same defences add time to the recovery — and the time is non-negotiable.

Single shared backbone for products with different risk profiles

The backbone carried traffic for Facebook, Instagram, WhatsApp, Messenger, Workplace, and Oculus alike. There was no architectural separation that would allow some properties to keep operating when others lost their network. The blast radius of a backbone event was therefore the whole company at once.

lessons

What to take from this incident.

Test your safety tools against the disasters they exist for.An audit tool that catches dangerous commands is only worth what its tests prove. Run synthetic versions of the worst commands you can imagine against the tool on a regular cadence, and treat any case where the tool fails to stop the command as a high-severity incident in its own right.

Local safety policies need a global view before they can act in lockstep.A policy that is right for one server can be wrong for the entire fleet if every server applies it at the same time. DNS withdrawal, traffic shedding, and similar self-protective behaviours need a coordination layer that can see the whole picture and apply a brake when too many components are reacting to the same event.

Build an incident-response path that does not require the affected infrastructure.If your VPN, your chat, your monitoring, your auth, and your video calls all sit behind the same DNS and the same backbone, an incident on that backbone takes the responders' tooling down with the product. Keep an out-of-band path — a separate provider, a separate auth chain, a phone bridge — for the case when the primary stack is what is broken.

Rehearse physical-access recovery the way you rehearse code rollbacks.When the recovery path is 'send someone to a data center', the speed of that path matters. Drilling badge issuance, escort, console access, and out-of-band login regularly keeps the on-site path from being the slowest part of a recovery the second time it happens.

Treat the audit tool's uptime and correctness as production-critical.If the audit tool is what stands between a maintenance command and a global outage, it deserves the same redundancy, monitoring, and on-call attention as the systems it protects. A broken audit tool is a silent failure that surfaces only when it is already too late.

sources

Read the original.

More details about the October 4 outage

engineering.fb.com ↗

Understanding How Facebook Disappeared from the Internet

blog.cloudflare.com ↗

← previous

FM-003 · The four-hour S3 typo

FM-005 · A latent CDN bug, woken by a valid config change