FM-001Cloudflare2019-07-02impact 27mSEV-1

A WAF rule pegs every Cloudflare CPU at once.

How a regex with one unbounded subpattern, deployed globally with no canary stage, burned every CPU core in Cloudflare's edge network — and why a global kill switch built years earlier turned a long outage into a 27-minute one.

networkingdeploywaf

summary

Every Cloudflare edge server had to make the same promise on every request: inspect quickly, then forward. The Web Application Firewall was part of that promise. It matched HTTP requests against managed rules before traffic reached customer origins. As long as each rule consumed bounded CPU, the WAF behaved like a thin filter. If one rule took unbounded time on common request shapes, inspection stopped being a filter and became the thing that consumed the proxy.

On the afternoon of July 2, 2019, an engineer merged a pull request adding new managed rules aimed at a class of JavaScript-based attacks. The pipeline did what it always did. CI built the rule set, ran the test suite, and approved it for deployment. Six minutes after the build, the rule was on every edge server in the network. Unlike Cloudflare's general software releases, which first reached an internal dogfood point of presence before going global, WAF rules were treated as data and pushed everywhere at once.

One of the new rules contained a regex with an unbounded inner subpattern: .*(?:.*=.*). On most strings it matched or rejected quickly. On a class of live HTTP bodies, it entered catastrophic backtracking — the engine explored exponentially many ways the pattern might match before concluding it could not. The WAF's regex engine had no per-evaluation time limit. Every core that processed HTTP traffic ran the rule on the same request shapes and stalled at nearly 100% CPU at the same time. HTTP and HTTPS traffic across the network collapsed within minutes.

The blast radius reached Cloudflare's own controls. The dashboard, the API, and the status page all served through the same edge proxy that was now too busy to forward anything. Engineers correlating the CPU spike with the WAF deployment could see the cause, but the usual path to disable the rule depended on the system the rule had just taken down.

What ended the incident was a piece of tooling built years earlier for exactly this category of failure: a global terminate mechanism that disabled the WAF on every edge server without going through the regular configuration path. The team executed it at 14:07 UTC. Two minutes later, traffic and CPU were back to normal. The WAF stayed off while engineers isolated the bad rule, then came back online with the rest of the rule set. Twenty-seven minutes of global edge outage, caused by a single unbounded subpattern in a regex — and a deploy procedure that put rule changes on a fast path that code changes were not allowed to use.

The SOP for a rule change specifically allows it to be pushed globally. This is very different from all the software we release at Cloudflare where the SOP first pushes software to an internal dogfooding network point of presence.// Cloudflare, Details of the Cloudflare outage on July 2, 2019

timeline · UTC

From the first signal to all-clear in 27m.

13:31 UTC

Pull request merged with new WAF rule

An engineer merges a pull request adding new managed WAF rules targeting JavaScript-based attacks. One of the rules contains a regular expression with an unbounded subpattern.

13:37 UTC

CI builds and tests pass

Cloudflare's build system compiles the rule set and runs the test suite against synthetic inputs. The tests pass and the rule is approved for production deployment.

13:42 UTC

Rule pushed globally in one wave

The standard procedure for WAF rule changes is to deploy them to the entire network at once. The rule lands on every edge server within seconds.

13:45 UTC

First PagerDuty alert from synthetic WAF tests

Synthetic checks against the WAF begin failing. CPU usage on cores handling HTTP and HTTPS traffic climbs toward 100% as the regex enters catastrophic backtracking on live request bodies.

13:47 UTC

HTTP traffic collapses worldwide

HTTP and HTTPS error rates spike globally. The Cloudflare dashboard, API, and status page — all served through the same edge network — also become unreachable. Customer reports begin flooding in.

14:00 UTC

WAF identified as the cause

Engineers correlate the CPU saturation with the just-deployed WAF rule set. Standard rollback paths are slow because the dashboard used to manage rules is itself unreachable through the edge.

14:02 UTC

Global kill switch proposed

The team proposes using a global terminate mechanism — a kill switch built years earlier specifically to disable the WAF without depending on the normal deploy pipeline.

14:07 UTC

Global WAF disabled

Engineers execute the global terminate. The WAF stops evaluating rules on every edge server.

14:09 UTC

CPU and traffic recover

CPU usage drops back to baseline almost immediately. HTTP and HTTPS traffic returns to expected levels worldwide. Total customer impact: 27 minutes.

14:52 UTC

WAF re-enabled after verification

After isolating the bad rule and testing the rest, engineers re-enable the WAF globally with the offending rule removed.

root cause

A regex with no bound, a deployment with no canary.

A new WAF rule contained a regular expression with the fragment `.*(?:.*=.*)`. On most inputs the regex matched or failed quickly. On a class of real HTTP request bodies, it triggered catastrophic backtracking — the regex engine explored an exponential number of paths before deciding the string did not match. Cloudflare's WAF used a non-linear regex engine, so there was no internal time bound that would cut the evaluation short. Every edge server ran the same rule on the same request shapes and saturated its CPU at the same time.

The deployment procedure for WAF rules pushed changes to the entire network in one step. Cloudflare's general software pipeline first deployed to an internal "dogfooding" point of presence, then to a small region, then globally. Rule changes were treated as data, not code, and went global at once. A bad rule had nowhere to fail safely before it reached every customer.

The blast radius extended to Cloudflare's own management surfaces. The dashboard, API, and status page all served through the same edge proxy that was burning CPU on the bad regex. Operators trying to roll the rule back through the normal control plane could not reach it. Recovery only became fast when the team reached for a global kill switch that had been built years earlier to disable the WAF without going through the standard pipeline.

contributing factors

What turned a regex bug into a global outage.

Rules deployed globally with no staged rollout

The standard operating procedure for WAF rule changes allowed a direct global push. Other Cloudflare software went through a dogfood point of presence and a small canary region first. Rules were treated as data and skipped that gate.

Tests checked correctness, not pathological cost

The CI tests covered functional behavior of the new rule against synthetic inputs. They did not test cost — how long the regex took to evaluate on adversarial or large request bodies. A test that timed out a regex on a worst-case input would have flagged the backtracking before merge.

Regex engine had no per-evaluation CPU budget

The WAF used a backtracking regex engine with no hard time or step limit per evaluation. A bounded engine — RE2, or any engine with a step counter — would have aborted the runaway match and treated the rule as a soft failure rather than a CPU sink.

Control plane rode the same edge it was meant to fix

The Cloudflare dashboard, API, and status page all served through the affected edge proxy. When the edge burned CPU, the surface engineers needed to roll back the change was unreachable. The normal recovery path required the broken system to work.

A pre-existing kill switch made the difference

Cloudflare had built a global WAF kill switch years before this incident, after similar fears. It bypassed the normal config pipeline and turned the WAF off in seconds. The kill switch did not prevent the incident; it bounded its duration.

lessons

What to take from this incident.

Treat rule and config changes with the same caution as code changes.WAF rules, feature flags, traffic policies, and routing rules all execute on production hot paths. A deployment pipeline that requires canary stages for binaries but pushes config globally has a gap that any operator can fall through. Apply the same staged rollout to anything that touches request processing.

Test regular expressions for cost, not just correctness.Regex engines that backtrack can take exponential time on inputs that look ordinary. CI should run new patterns against fuzz inputs with a hard wall-clock budget, or use a static analyzer that flags nested quantifiers. A regex that passes functionally but takes seconds on a known adversarial string is a production CPU bomb waiting for traffic.

Bound every regex evaluation in a hot path.Use a linear-time engine like RE2 where the pattern set tolerates it, or wrap a backtracking engine in a step or wall-clock limit. A regex that exceeds the budget should fail closed, not run forever. The bound is the thing that turns a worst-case input from an outage into a single request error.

Build a kill switch for the systems you operate before you need it.The fastest path to recovery was a global terminate built years earlier for exactly this class of incident. Kill switches for traffic-shaping systems — WAF, rate limiters, bot management, request filters — should exist, be regularly drilled, and operate independently of the normal config pipeline.

Keep incident-response surfaces off the system in outage.When the dashboard, API, and status page all flow through the same proxy that is failing, customers and operators both lose visibility at the same moment. Host status and break-glass tooling on a path that does not share fate with the data plane it reports on.

sources

Read the original.

Details of the Cloudflare outage on July 2, 2019

blog.cloudflare.com ↗

FM-002 · A 43-second partition splits GitHub's database for a day