Cloudflare's Edge Stayed Up. Its Control Plane Went Dark
How a Portland General Electric event, a non-standard generator practice, and a UPS that drained in four minutes instead of ten took Cloudflare's most heavily used data center off the air — and why the HA cluster that was supposed to survive its loss did not.
Cloudflare's edge could keep serving traffic without a live dashboard. That distinction mattered. The data plane was the global edge fleet in more than 300 cities, proxying requests from configuration already distributed to each server. The control plane was the dashboard, API, Logs and Analytics ingestion, Workers KV writes, Stream uploads, and the services customers used to configure and observe that edge. The control plane lived in three core facilities, all in the Portland, Oregon area. As long as losing one facility did not take the others with it, customers kept their management surface. When that assumption failed, the edge stayed up and the controls went dark.
The most heavily used of the three core facilities was Flexential's PDX-04 in Hillsboro, Oregon. It hosted Cloudflare's largest analytics cluster, more than a third of the HA cluster machines, and the default landing spot for services that had not yet been integrated into HA. Cloudflare used about 10% of the facility's capacity.
On the morning of November 2, 2023, Portland General Electric had an unplanned event that took down one of two utility power feeds to PDX-04. Flexential elected to operate the remaining utility feed and the generators in parallel, rather than fail over between them — a non-standard arrangement. Around 11:40 UTC a ground fault on a PGE transformer damaged the second utility feed and triggered a protective shutdown of every generator at the facility. The site fell back to UPS power for its full load. The UPS systems, rated for ten minutes, drained in about four. The overnight skeleton crew lacked the expertise to restart the generators by hand. PDX-04 went dark.
The edge did not notice. Customer traffic kept flowing through the data plane, proxied by the same edge servers in the same cities. The control plane noticed immediately. Services that were supposed to fail over to the two surviving core facilities did not all come back. Several services on the HA cluster had quietly accumulated dependencies on services running exclusively in PDX-04 — Kafka and ClickHouse pieces, Stream's video upload service, and others — and those dependencies broke the moment the facility was lost. Because Cloudflare had never tested taking PDX-04 fully offline, the drift had been invisible until the facility actually went away.
What followed was not a recovery of one failed component but a manual stand-up of a control plane somewhere other than where it was supposed to live. The team decided at 13:40 UTC to stop waiting for PDX-04 and bring dependent services up in disaster recovery sites instead, in dependency order. The dashboard and API were restored for most customers by 17:57 UTC. PGE replaced the damaged circuit breakers and both utility feeds returned to the facility around 22:48 UTC; Cloudflare equipment came back online carefully through November 3 after validation. All affected services were confirmed restored at 04:25 UTC on November 4.
The mechanism worth carrying away is not that a colocation facility lost power. Power events happen. The mechanism is that an invariant the system was supposed to honor — the loss of any one core facility being survivable — had drifted out of true while the design still said it was honored. The drift was invisible until a facility was actually lost. Geographic concentration, third-party power assumptions, and a never-rehearsed facility-loss drill each kept the invariant from being real.
We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically.// Matthew Prince, Post mortem on the Cloudflare control plane and analytics outage
From the first signal to all-clear in ~36h.
A facility loss that was supposed to be survivable, and was not.
Cloudflare's control plane was designed to survive the loss of any one of its three core data centers in the Portland, Oregon area. On the morning of November 2, the most heavily used of the three lost utility power, was then run on a non-standard mix of utility and generator power, and finally lost both the second utility feed and every generator at once when a ground fault hit a PGE transformer. UPS batteries that should have given the facility ten minutes drained in about four. The overnight skeleton crew could not bring the generators back by hand. The facility went dark.
The data plane did not go down with it. Cloudflare's edge servers run from configuration already distributed to each site and continue proxying customer traffic without live contact with the control plane. Throughout the incident, customer requests kept being served. What stopped working was the layer customers used to manage their service: the dashboard, the API, Logs and Analytics ingestion, Workers KV writes, Stream video uploads, and a long list of products whose control paths depended on services hosted in the failed facility.
The deeper failure was that the HA invariant did not hold in practice. Services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04. Cloudflare had never tested taking the entire facility offline, so the dependency drift was invisible until the facility was actually lost. When PDX-04 went away, the remaining two core sites could not pick up everything that was supposed to fail over to them, and recovery turned into a by-hand cutover to disaster recovery sites rather than a clean automatic failover.