~/library/FM-026
FM-026Cloudflare2025-12-05impact 25mSEV-2

A WAF killswitch exposed a proxy bug.

While responding to a React Server Components vulnerability, Cloudflare increased WAF body buffering through gradual rollout. A separate global configuration change disabled an internal test tool and triggered a nil execute path in the older FL1 proxy.

wafconfigproxyluaedgekillswitch

Cloudflare was responding to a React Server Components vulnerability by changing how its WAF inspected request bodies. The first change increased the proxy buffer from 128KB to 1MB and used the gradual deployment system. That was the expected high-risk path: customer traffic would be parsed differently, so Cloudflare rolled it out slowly.

The failure came from a second change that looked safer. Cloudflare noticed an internal WAF testing tool did not support the larger buffer and decided to turn that tool off. The tool was not needed for customer traffic, and Cloudflare had a standard killswitch procedure for disabling misbehaving rules. But this switch traveled through the global configuration system, which propagated within seconds rather than through gradual rollout.

In the older FL1 proxy, disabling an execute rule exposed a bug in the rules module. The code skipped evaluation of the sub-ruleset, then later assumed the execute object still existed while processing the overall result. Lua returned a nil-value error, and affected requests received HTTP 500.

Only a subset of customers met the full condition: their sites used the older FL1 proxy and had the Cloudflare Managed Ruleset deployed. Cloudflare reverted the change and traffic recovered within 25 minutes. Internal test infrastructure can share runtime structures with the live request path. Disabling it is still a request-path change, and it needed the same gradual rollout discipline as the buffer-size change it was meant to support.

This system does not perform gradual rollouts// Cloudflare postmortem, December 2025

From the first signal to all-clear in 25m.

08:47 UTC
Configuration change propagates
A global configuration change disabled Cloudflare's internal WAF testing tool. The change propagated across the network within seconds.
08:48 UTC
Full impact begins
The change reached the fleet and triggered an FL1 proxy bug. Affected requests returned HTTP 500 errors.
08:50 UTC
Incident declared
Automated alerts fired and Cloudflare declared an incident shortly after the impact started.
09:11 UTC
Change reverted
Cloudflare reverted the configuration change and started propagating the revert through the network.
09:12 UTC
Traffic restored
The revert finished propagating, and all traffic was served correctly again.

A global killswitch removed an execute action the old proxy still expected.

The immediate cause was a global configuration change that disabled an internal WAF rule testing tool. In the older FL1 proxy, that change caused rules module code to reach a path where it attempted to index an `execute` field that was nil, producing Lua exceptions and HTTP 500 responses.

The deeper cause was a split rollout model. The WAF buffer-size increase was using gradual deployment, but the second change used a global configuration system that propagated within seconds and did not perform gradual rollouts. The change looked operationally low-risk because it disabled an internal test tool, but it still modified live request-processing behavior.

What turned emergency mitigation into customer errors.

01
The second change bypassed gradual rollout.
Cloudflare was rolling out the buffer increase gradually, but the testing-tool disablement used a faster global configuration path. The safer rollout discipline did not apply to the change that failed.
02
Legacy proxy behavior differed.
Only customers on the older FL1 proxy with the Managed Ruleset were affected. Mixed runtime versions increase the number of code paths a configuration change must be validated against.
03
A testing feature was still in the request path.
The internal testing tool had no intended customer effect, but its ruleset integration used the same execute mechanism as live rule evaluation. Disabling it still changed live proxy behavior.

What to take from this incident.

01
Apply rollout controls to every request-path config change.A change that disables an internal tool can still affect production if it changes runtime structures. Use canaries and runtime-version coverage for all request-path configuration.
02
Test emergency mitigations against legacy runtimes.Incident response often uses less common switches. Those switches need automated coverage across old and new runtime versions, especially when migrations are in progress.
03
Prefer inert states over missing fields.A killswitch should turn behavior into an explicit no-op, not remove fields that downstream code may assume exist. Preserve structural invariants even when disabling behavior.

Read the original.

Cloudflare outage on December 5, 2025
blog.cloudflare.com
← previous
FM-025 · Workers KV storage dependency outage
next →
FM-027 · The Runner Cache Bug That Queued Ubuntu CI Jobs