FM-025Cloudflare2025-06-12impact 2h 28mSEV-1

One unavailable database broke many Cloudflare products

Workers KV kept frequently requested data near users, but a central storage system still held the authoritative copy. When that system failed, Cloudflare could serve only data that was already cached.

storage kv third-party edge cache vendor

summary

Workers KV is a key-value database. An application stores a value under a key and retrieves it later, much like using a server-side cache. Cloudflare runs Workers KV across its worldwide network so frequently requested values can be served from a location near the user.

The nearby copy is only a cache, however. When a requested key is not already stored in that local cache, Workers KV must fetch it from central backing storage. Cloudflare calls this a cold read. A cold read means the cache did not have the value, so the request must travel to the backing storage to retrieve it. A cache miss means the same thing from the application's perspective. Writes also go through the backing storage, because that storage holds the authoritative copy. The authoritative copy is the definitive version of each value, kept in one central place.

On June 12, 2025, an outage at a third-party provider disrupted that backing storage. Already-cached reads could still work, but uncached reads and writes returned HTTP 500 or 503. Workers KV reported that 90.22% of requests failed. No data was lost. The service simply could not reach the authoritative copy while the backend was unavailable.

The failure then spread through products that used KV. Access, Cloudflare's authentication product, could not load identity and policy data, so logins were denied. Gateway could not evaluate some security rules. New WARP device registrations failed. Pages builds and Workers AI requests also failed because they needed configuration or routing data from KV.

Recovery was not instantaneous when storage returned. Services around the world tried to refill their caches at the same time. That surge of requests hit a backend that was still stabilizing, causing more errors. This incident is a recognizable distributed-systems failure. Many features shared one data dependency. Their caches covered only some reads. When the central store went down, most products had no useful fallback.

Workers KV saw 90.22% of requests failing// Cloudflare postmortem, June 2025

timeline · UTC

From the first signal to all-clear in 2h 28m.

17:52 UTC

WARP device registrations fail

The WARP team saw new device registrations fail and declared an incident. The underlying issue was Workers KV unavailability.

18:05 UTC

Access error alerts fire

Cloudflare's authentication product reported rapidly increasing errors. Reliability targets for several other products also dropped below their expected levels.

18:21 UTC

Incident reaches highest severity

Cloudflare declared its highest incident level as the shared database failure spread across authentication, security, developer, and AI products.

19:09 UTC

Services reduce KV dependence

Cloudflare's secure web gateway temporarily simplified rules that needed user identity or device data, allowing some traffic to stop querying Workers KV.

20:23 UTC

Storage infrastructure recovers

The backing storage came online, but many services tried to refill their caches at once. That surge and storage rate limits caused some requests to keep failing.

20:28 UTC

Impact ends

Error rates returned to their normal levels and the affected products worked again.

root cause

The cache could hide the failure only when it already had the data.

Think of Workers KV as a globally distributed key-value database with a cache in many Cloudflare locations. A nearby cache could answer a read if it already held the requested value. If it did not, Workers KV had to fetch the value from central backing storage. Every write also had to reach that storage because it held the authoritative copy.

Part of that backing storage depended on a third-party cloud provider. When the provider had an outage, uncached reads and all writes failed with HTTP 500 or 503 errors. Cloudflare was partway through redesigning this backend and said the transition had left a gap in redundancy. Because many Cloudflare products stored important data in KV, the storage failure became an authentication, security, build, and AI outage too.

contributing factors

Why a database failure spread across the product suite.

Many products shared the same database.

Cloudflare reused Workers KV for login data, security policies, routing configuration, build assets, and AI configuration. That made those products easier to build, but it also gave them a common point of failure.

The cache was not a complete fallback.

A cached read could succeed, but a request for a value not already nearby had to contact the failed backend. Writes always needed that backend. This is why some requests worked while roughly 90% failed.

Security products chose blocking over bypass.

Access and Gateway could not safely allow requests without current identity and policy data. They denied access rather than risk ignoring a customer's security rules. That was the safer choice, but it made the outage more visible to users.

Recovery created a traffic spike.

When storage returned, many empty or stale caches requested data at the same time. Cloudflare had to restore groups of KV data gradually so the recovering backend was not overwhelmed.

lessons

What to take from this incident.

Treat a shared data service as part of every product that uses it.If one database stores authentication, configuration, routing, and assets, its outage can break all four. Document which user flows fail when the dependency is slow, read-only, or completely unavailable.

Design an explicit fallback for each critical read.Decide whether a feature may use a recent cached value, switch to a reduced feature set, or must reject the request. For security data, set a clear age limit and preserve a locally verified snapshot.

Plan for the cache-refill surge.A backend is not fully recovered just because it accepts requests again. Restore customers and data groups in stages, limit concurrent refills, and prioritize login and other critical paths.

sources

Read the sources.

Cloudflare service outage June 12, 2025

blog.cloudflare.com ↗

← previous

FM-024 · Azure Front Door config crashes edge sites

FM-026 · The WAF Killswitch That Crashed the Older Proxy