FM-008Cloudflare2023-11-02impact ~36hSEV-1

Cloudflare's control plane loses its primary facility.

How a Portland General Electric event, a non-standard generator practice, and a UPS that drained in four minutes instead of ten took Cloudflare's most heavily used data center off the air — and why the HA cluster that was supposed to survive its loss did not.

datacenterhacontrol-plane

summary

Cloudflare's edge could keep serving traffic without a live dashboard. That distinction mattered. The data plane was the global edge fleet in more than 300 cities, proxying requests from configuration already distributed to each server. The control plane was the dashboard, API, Logs and Analytics ingestion, Workers KV writes, Stream uploads, and the services customers used to configure and observe that edge. The control plane lived in three core facilities, all in the Portland, Oregon area. As long as losing one facility did not take the others with it, customers kept their management surface. When that assumption failed, the edge stayed up and the controls went dark.

The most heavily used of the three core facilities was Flexential's PDX-04 in Hillsboro, Oregon. It hosted Cloudflare's largest analytics cluster, more than a third of the HA cluster machines, and the default landing spot for services that had not yet been integrated into HA. Cloudflare used about 10% of the facility's capacity.

On the morning of November 2, 2023, Portland General Electric had an unplanned event that took down one of two utility power feeds to PDX-04. Flexential elected to operate the remaining utility feed and the generators in parallel, rather than fail over between them — a non-standard arrangement. Around 11:40 UTC a ground fault on a PGE transformer damaged the second utility feed and triggered a protective shutdown of every generator at the facility. The site fell back to UPS power for its full load. The UPS systems, rated for ten minutes, drained in about four. The overnight skeleton crew lacked the expertise to restart the generators by hand. PDX-04 went dark.

The edge did not notice. Customer traffic kept flowing through the data plane, proxied by the same edge servers in the same cities. The control plane noticed immediately. Services that were supposed to fail over to the two surviving core facilities did not all come back. Several services on the HA cluster had quietly accumulated dependencies on services running exclusively in PDX-04 — Kafka and ClickHouse pieces, Stream's video upload service, and others — and those dependencies broke the moment the facility was lost. Because Cloudflare had never tested taking PDX-04 fully offline, the drift had been invisible until the facility actually went away.

What followed was not a recovery of one failed component but a manual stand-up of a control plane somewhere other than where it was supposed to live. The team decided at 13:40 UTC to stop waiting for PDX-04 and bring dependent services up in disaster recovery sites instead, in dependency order. The dashboard and API were restored for most customers by 17:57 UTC. PGE replaced the damaged circuit breakers and both utility feeds returned to the facility around 22:48 UTC; Cloudflare equipment came back online carefully through November 3 after validation. All affected services were confirmed restored at 04:25 UTC on November 4.

The mechanism worth carrying away is not that a colocation facility lost power. Power events happen. The mechanism is that an invariant the system was supposed to honor — the loss of any one core facility being survivable — had drifted out of true while the design still said it was honored. The drift was invisible until a facility was actually lost. Geographic concentration, third-party power assumptions, and a never-rehearsed facility-loss drill each kept the invariant from being real.

We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically.// Matthew Prince, Post mortem on the Cloudflare control plane and analytics outage

timeline · UTC

From the first signal to all-clear in ~36h.

Nov 2, 08:50 UTC

PGE unplanned maintenance disables one power feed

Portland General Electric has an unplanned event that takes down one of two utility power feeds to Flexential's PDX-04 facility in Hillsboro, Oregon — the site that houses Cloudflare's most heavily used core data center.

Nov 2, ~10:00 UTC

Flexential runs utility and generators together

Rather than fail over to generators, Flexential runs the remaining utility feed and generators in parallel. The combined arrangement is non-standard and means a single fault on the utility side can hit both at once.

Nov 2, 11:40 UTC

Ground fault destroys the second feed and all generators

A ground fault on a PGE transformer damages the second utility feed and triggers a protective shutdown of every generator at the facility. The site falls back to UPS power for the entire load.

Nov 2, 11:43 UTC

Routers go offline; control plane impact begins

Routers and other equipment in PDX-04 start losing power as UPS batteries take the full load. Cloudflare's dashboard, API, and analytics services begin returning errors. The data plane that proxies customer traffic continues operating from pre-distributed configuration on edge servers in more than 300 cities.

Nov 2, 11:44–12:01 UTC

UPS batteries drain in about four minutes

The UPS systems, designed for around ten minutes of capacity, drain in about four. The overnight skeleton crew on site lacks the expertise to bring the generators back by hand. The facility goes fully dark.

Nov 2, 12:28 UTC

Flexential first message acknowledging the issue

Flexential sends its first message to Cloudflare acknowledging the incident. Cloudflare has been responding to the impact for about forty-five minutes.

Nov 2, 12:48 UTC

Generators restarted; partial power returns

Generators are restarted and power returns to portions of the facility. The recovery does not bring PDX-04 services back automatically; Cloudflare equipment requires manual validation before being returned to service.

Nov 2, 13:40 UTC

Decision to fail over to disaster recovery sites

Cloudflare decides not to wait for PDX-04 to come back and begins standing up the control plane in its disaster recovery sites instead. The cutover is staged carefully so that dependent services come up in the right order.

Nov 2, 17:57 UTC

Disaster recovery services stabilized

Disaster recovery services are stable. The dashboard, API, and most control plane features are restored for the majority of customers. Some services that had unintended dependencies on PDX-04 — including parts of the Kafka and ClickHouse stack and Stream's video upload service — remain degraded.

Nov 2, 22:48 UTC

Circuit breakers replaced; utility feeds restored

Damaged circuit breakers at PDX-04 are replaced and both utility feeds come back. The facility is now powered, but Cloudflare's equipment in it still needs to be brought up carefully.

Nov 3

PDX-04 services brought back by hand

Cloudflare validates servers and storage in PDX-04 before returning them to service, to avoid bringing inconsistent state back into the network.

Nov 4, 04:25 UTC

All services restored

Cloudflare confirms that all affected services have been restored. Total elapsed time from initial impact to full recovery: about 36 hours. Backfills for Logs and Analytics continue past this point.

root cause

A facility loss that was supposed to be survivable, and was not.

Cloudflare's control plane was designed to survive the loss of any one of its three core data centers in the Portland, Oregon area. On the morning of November 2, the most heavily used of the three lost utility power, was then run on a non-standard mix of utility and generator power, and finally lost both the second utility feed and every generator at once when a ground fault hit a PGE transformer. UPS batteries that should have given the facility ten minutes drained in about four. The overnight skeleton crew could not bring the generators back by hand. The facility went dark.

The data plane did not go down with it. Cloudflare's edge servers run from configuration already distributed to each site and continue proxying customer traffic without live contact with the control plane. Throughout the incident, customer requests kept being served. What stopped working was the layer customers used to manage their service: the dashboard, the API, Logs and Analytics ingestion, Workers KV writes, Stream video uploads, and a long list of products whose control paths depended on services hosted in the failed facility.

The deeper failure was that the HA invariant did not hold in practice. Services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04. Cloudflare had never tested taking the entire facility offline, so the dependency drift was invisible until the facility was actually lost. When PDX-04 went away, the remaining two core sites could not pick up everything that was supposed to fail over to them, and recovery turned into a by-hand cutover to disaster recovery sites rather than a clean automatic failover.

contributing factors

What turned a single-facility power event into a 36-hour control plane outage.

Third-party power infrastructure failed deeper than assumed

Cloudflare relied on Flexential to deliver redundant power through utility and generator paths. On the day of the incident, Flexential ran utility and generators in parallel rather than failing over between them. A single ground fault then took both feeds and every generator at once, and UPS batteries drained in less than half their rated time.

HA dependencies on PDX-04 had drifted in silently

Several services that were supposed to be on the HA cluster had accumulated dependencies on services running exclusively in PDX-04. Because the design said the system could lose any facility, those dependencies were not exercised — and because they were not exercised, no one knew they were there. The drift only surfaced when the facility actually went down.

Full-facility loss had never been tested

Cloudflare ran service-level failover drills but had never powered off, or fully isolated, PDX-04. The HA story relied on a failover that, in practice, had never been triggered by a real loss of an entire core facility. When the loss happened, the team learned the gap during the incident, not in a drill.

Cutover to disaster recovery required by-hand sequencing

Once Cloudflare decided to fail over to disaster recovery sites, dependent services had to be brought up in the right order. That required knowing the dependency graph well enough to sequence it under pressure. Recovery scaled with the number of dependent services, not with the number of failed machines.

Geographic concentration of the core control plane

All three core data centers were in the Portland, Oregon metropolitan area. The design protected against a single facility failure but did not protect against correlated events affecting the region. The geographic concentration was a known trade-off; the incident showed what it cost when one facility's power infrastructure failed in unusual ways.

lessons

What to take from this incident.

Test the loss of a whole facility, not just the loss of services in it.Service-level failover drills do not exercise the hidden dependencies that appear when a whole facility disappears at once. A periodic exercise that powers off, or fully isolates, one core facility surfaces the cross-product assumptions that would otherwise only be tested by a real outage.

Treat facility-level redundancy as an invariant to be enforced.An audit that maps every control plane service to the facilities it can survive losing is a feature, not a one-off review. New services should not be able to launch in a state that breaks the invariant, and existing services should be flagged when their dependencies drift back to one site.

Separate the customer message for data plane and control plane.Customers experience a control plane outage very differently when their traffic is still being served. Saying clearly that the data plane is up while the dashboard, API, and analytics are degraded lets customers calibrate their response, escalate the right things, and avoid emergency work for impact that is not affecting their users.

Treat third-party power and cooling as failure domains, not infinite resources.A colocation provider can lose utility power, generator power, and UPS capacity in the same event, and can also choose operational configurations that make one fault take both feeds at once. Drills and runbooks should account for full facility loss in finite time, including UPS drain shorter than the rated capacity.

Geographic diversity matters even when single-facility redundancy is solid.Three facilities in one metropolitan area cannot protect against correlated regional events. Splitting core facilities across geographies bounds the worst case and reduces the chance that one provider's incident takes more than one of the core sites at the same time.

sources

Read the original.

Post mortem on the Cloudflare control plane and analytics outage

blog.cloudflare.com ↗

← previous

FM-007 · A maintenance script deletes 883 customer sites

FM-009 · A telemetry rollout takes down ChatGPT for four hours