FM-025Cloudflare2025-06-12impact 2h 28mSEV-1

Workers KV lost its central storage dependency.

Cloudflare built many products on Workers KV, but KV still depended on a central storage source of truth partly backed by a third-party provider. When that storage failed, cold reads and writes failed across dependent products.

storagekvthird-partyedgecachevendor

summary

Workers KV looked like an edge service to many of the products built on it. Reads could come from cache close to users, and Cloudflare used KV as a common building block for configuration, identity, routing, assets, and product state. The condition underneath that promise was narrower: any uncached key or write still needed the origin storage backends.

On June 12, 2025, a third-party vendor failure affected that storage layer. Cloudflare emphasized that it remained responsible for the dependency choice and the architecture around it. Cached KV reads could still succeed, but cold reads and writes returned 500 or 503 responses. Workers KV reported a 90.22% request failure rate during the incident window.

The cascade followed the dependency graph. Access could not fetch identity and policy configuration, so identity-based logins failed closed. Gateway could not retrieve up-to-date identity and device posture information for some rules. WARP registrations failed. Turnstile and Challenges used kill switches to avoid blocking users. Pages builds, Workers AI inference, Stream playlists, Realtime TURN, and other products saw their own failures.

Recovery began as the underlying storage infrastructure recovered, but Cloudflare still had to manage service-specific mitigations and cache refill pressure. Workers KV looked like an edge cache to the products built on it, but cold reads and writes still required a central origin. When that origin failed, every product that used KV for configuration, identity, routing, or assets discovered the same dependency they had assumed away.

Workers KV saw 90.22% of requests failing// Cloudflare postmortem, June 2025

timeline · UTC

From the first signal to all-clear in 2h 28m.

17:52 UTC

WARP device registrations fail

The WARP team saw new device registrations fail and declared an incident. The underlying issue was Workers KV unavailability.

18:05 UTC

Access error alerts fire

Cloudflare Access received alerts for rapidly increasing error rates, and multiple service SLOs dropped below target.

18:21 UTC

Incident upgraded to P0

The severity became clear as Workers KV failures affected Access, Gateway, WARP, Dashboard login, Turnstile, Workers AI, Pages, Queues, D1, Durable Objects, and other services.

19:09 UTC

Services reduce KV dependence

Zero Trust Gateway began degrading rules that referenced identity or device posture state so it could remove some Workers KV dependency.

20:23 UTC

Storage infrastructure recovers

Services began recovering as the underlying storage infrastructure recovered, though cache repopulation and rate limits kept error rates nonzero.

20:28 UTC

Impact ends

Service-level objectives returned to pre-incident levels and affected services returned to normal function.

root cause

The edge cache missed and the origin store was down.

The immediate cause was failure in the underlying storage infrastructure used by Workers KV. KV cached some reads at the edge, but cold reads and writes required the central storage backend. When that backend failed, requests that could not be served from cache returned 500 or 503 errors.

The deeper cause was dependency concentration. Workers KV was described as coreless because it ran independently in locations worldwide, but it still relied on a central source of truth while Cloudflare was migrating its backend architecture. Many Cloudflare products then used KV for configuration, identity, routing, and asset delivery, turning KV's storage dependency into their dependency.

contributing factors

What turned one storage dependency into many product outages.

KV was a shared platform primitive.

Access, Gateway, WARP, Workers AI, Pages, Queues, Turnstile, Dashboard login, D1, Durable Objects, and other services relied on KV directly or indirectly. The shared platform simplified product development but widened the blast radius.

Cold paths still needed origin storage.

Cached KV reads could continue, but any uncached key or write had to reach the storage backend. That made cache warmth the difference between success and failure during the outage.

Security products failed closed.

Access and Gateway relied on KV for policy and identity information. When that data could not be fetched, they failed closed to avoid bypassing customer rules, which protected policy semantics but amplified user-visible unavailability.

Recovery risk included cache repopulation.

As storage recovered, many dependent services repopulated caches and resumed reads at once. Cloudflare had to consider rate limits and progressive re-enablement rather than simply turning all namespaces back on.

lessons

What to take from this incident.

Map platform primitives as shared failure domains.A dependency used for config, auth, routing, and assets deserves the same failure-domain treatment as a database or network backbone. Product teams should know exactly what fails when it is unavailable.

Provide stale-safe modes for critical config.Security-sensitive services may need to fail closed, but they can often continue with bounded-staleness policy snapshots, signed local cache, or reduced rule sets. Define those modes before the dependency fails.

Throttle re-enablement after storage recovery.Cache refill can overload a recovering store. Re-enable namespaces and tenants progressively, with priority for identity, access, and other critical control paths.

sources

Read the original.

Cloudflare service outage June 12, 2025

blog.cloudflare.com ↗

← previous

FM-024 · Azure Front Door config crashes edge sites

FM-026 · The WAF Killswitch That Crashed the Older Proxy