An oversized bot feature file stopped core traffic.
A database permission change altered a ClickHouse metadata query used to generate Bot Management features. The resulting file doubled in size, exceeded a proxy limit, and caused widespread HTTP 5xx responses.
Cloudflare's Bot Management system kept part of its detection behavior in a generated feature file. The core proxy read that file so it could classify changing bot traffic without waiting for a full software release. The promise was speed: regenerate the file, distribute it globally, and let the network adapt.
The hidden assumption was that an internally generated file would stay inside the shape the proxy expected. A database access control change altered what a ClickHouse metadata query returned. The feature generator did not filter tightly enough, so it produced duplicate feature rows. The Bot Management feature file doubled in size and crossed a limit in the proxy module.
The permissions change rolled across the ClickHouse cluster gradually, so the feature file alternated between good and bad versions depending on which node generated it. From the inside, the symptoms looked like an attack: error rates rose and fell on a pattern the network had not seen before. Cloudflare first chased the wrong cause. Workers KV, Access, Turnstile, and dashboard login failures all surfaced because they shared the same core proxy path, and the team investigated those symptoms before identifying the Bot Management file as the source of widespread HTTP 5xx responses.
Recovery required stopping new bad files, restoring a known-good file, and restarting downstream systems that had accumulated load while traffic failed. The database permissions change was the trigger, but the failure ran through a fast global configuration path that treated generated internal data as safe input — validated only after it reached the live request path.
That feature file, in turn, doubled in size.// Cloudflare postmortem, November 2025
A ClickHouse permissions change began rolling through the database cluster used by Bot Management feature-file generation.
11:28 UTC
Customer traffic starts returning errors
The generated feature file reached customer environments and the first HTTP 5xx errors appeared on core traffic.
11:32 UTC
Teams investigate Workers KV symptoms
Initial investigation focused on elevated Workers KV errors and apparent downstream impact. Automated tests had detected the issue at 11:31 and manual investigation began at 11:32.
13:05 UTC
Bypasses reduce some downstream impact
Cloudflare used internal bypasses for Workers KV and Access so they could fall back to an older proxy path, reducing but not eliminating the impact.
14:24 UTC
Bad Bot Management file identified
Engineers identified Bot Management configuration as the source of the 500 errors and stopped automatic generation and propagation of new feature files.
17:06 UTC
All services restored
After a known-good feature file was propagated and affected services restarted, Cloudflare reported all systems functioning normally.
A metadata query treated hidden rows as product features.
The immediate cause was a ClickHouse query behavior change after a database permissions rollout. A metadata query used by Bot Management feature-file generation did not filter by database name, so it began returning duplicate rows from underlying tables. That produced a feature file with more than 200 features, above the proxy module's runtime limit.
The deeper cause was that internally generated configuration was trusted too much. The feature file was regenerated every few minutes and distributed rapidly worldwide, but the ingestion path did not validate its size and semantics as defensively as it would validate user input. Once the oversized file reached the proxy fleet, the Bot Management module hit an unhandled error path.
What turned a query change into a core traffic outage.
01
The generator depended on database metadata shape.
The feature pipeline treated ClickHouse system-table output as stable input. A permissions change that exposed additional table metadata changed the row set without changing the application code.
02
Rapid global propagation skipped a safety brake.
The feature file existed to react quickly to bot behavior and was published across the network every few minutes. That speed meant a bad file could reach the global proxy fleet before operators understood the symptom.
03
The proxy limit failed closed with 5xx errors.
The module had a 200-feature limit for memory preallocation, but exceeding it produced an unhandled failure rather than a rejected file or degraded Bot Management behavior. A protective limit became an availability fault.
04
Early symptoms pointed at the wrong dependency.
Workers KV and Access errors were visible because those services relied on the core proxy. The team initially chased those symptoms, delaying focus on the Bot Management configuration file.
Validate generated config like external input.Configuration produced by internal systems can still be malformed. Enforce size limits, schema checks, semantic invariants, and canary parsing before global propagation.
02
Make protective limits degrade deliberately.A memory or feature-count limit should reject the bad artifact, keep the previous known-good artifact, or disable a narrow function. It should not crash the request path.
03
Keep a fast global rollback path for config.High-frequency configuration needs a last-known-good restore path that is rehearsed and observable. Recovery should not depend on manually reconstructing the artifact under incident pressure.