The Storage Failure That Broke 90% of Workers KV Requests
How a third-party storage outage cut off cold reads and writes, why cached reads survived while shared product dependencies failed, and how cache repopulation constrained recovery.
Workers KV failed 90.22% of its requests during Cloudflare's June 12, 2025 outage. Every Workers AI inference request failed. But requests served from Workers KV's cache continued returning successful or expected responses.
Cloudflare called Workers KV 'coreless,' meaning it ran independently at each location worldwide, but it still depended on a central data store. That central store was the single place where the real, up-to-date copy of every stored value was kept. Cloudflare's later documentation says writes go to central stores instead of every location's cache. A cold read happens when the local cache does not have the requested value and the request must go to central storage. A cold read traverses regional and central cache tiers before reaching those stores. Those 2026 documents explain the product model, but they do not establish the incident's exact storage topology.
Workers KV carried configuration, authentication, and asset-delivery state for many Cloudflare products. Access stored application policy and user identity information there. Workers AI used it to distribute routing and configuration globally. That level of reuse meant a single Workers KV failure would affect all of those unrelated products at the same time.
A third-party cloud provider outage disrupted part of Workers KV's underlying storage infrastructure. The central-store failure stopped cold reads and writes across KV namespaces used by Cloudflare services. Uncached requests that needed origin storage returned HTTP 500 or 503 errors. Edge caches contained the failure only for keys they already held.
Workers KV had removed one storage provider while re-architecting its backend, including a migration toward Cloudflare R2. Cloudflare said the transition still had a coverage gap. The provider outage exposed that unfinished resilience work, but it did not create the dependency concentration by itself.
The outage lasted two hours and 28 minutes and affected customers using the impacted services globally. Access failed every identity-based login across self-hosted, SaaS, and infrastructure applications. Pages error rates peaked near 100%. No Pages build completed during the incident window. Cloudflare reported no data loss and said the incident was not a security event.
Detection began at the product edge, with WARP registration failures rather than a published storage-layer signal. A service-level objective is a reliability target that a team commits to, such as keeping 99.9% of requests successful. Access error-rate alerts and service-level objective alerts triggered across multiple products, showing how broadly the failure had spread. Correlating those symptoms around Workers KV turned separate service incidents into one dependency incident. The published timeline names downstream alerts, but it does not identify a storage-layer alert or an initial responder hypothesis.
Priority rose as responders moved from a shared dependency diagnosis to the full cross-product severity. Because the outage duration was uncertain, Access treated another datastore as a contingency. Continued preparation showed that the alternate-backed release was not an immediate failover path. That constraint made product-level degradation and load shedding the available near-term controls.
Gateway normally failed closed when Workers KV could not provide identity or device-posture state. Gateway traded some identity-aware rule behavior for reduced dependence on unavailable state. Access made a different trade: reject identity-related work to reduce pressure on Workers KV. Cloudflare did not publish how much traffic these actions restored or whether load shedding accelerated recovery.
Turnstile used a kill switch so users were not blocked while its dependencies were unavailable. The switch also allowed valid tokens to be redeemed more than once, creating potential replay exposure. That choice preserved user passage by temporarily changing security semantics.
Recovery remained gated by the external storage dependency. Restoring the backend did not immediately clear the errors, because services around the world were all trying to refill their caches at once. KV calls and service-level objectives recovered in layers rather than all at once. Cloudflare confirmed that no stored data was lost, making this a service availability failure rather than a data loss event. Cloudflare proposed tooling to bring namespaces back online gradually, to avoid all services overwhelming the backend at the same time.
Cloudflare committed to removing Workers KV's dependency on any single storage provider. It also began reducing how many products fail together when one dependency goes down, and built tooling to restore KV namespaces gradually. By August 2025, Cloudflare reported completing a hybrid storage-provider rollout with improved redundancy. The dossier sources do not verify completion of the product-level work or progressive namespace controls.
The disclosed causal chain ends at an unnamed provider. Its upstream failure mechanism remains outside the published record. Cloudflare also left some unexpected CDN rerouting behavior under investigation. Neither gap changes the proven Workers KV mechanism, but both limit broader claims about the outage.
Even a worldwide service has a single point of failure if uncached reads and writes all depend on one central store. When switching storage providers, confirm the new setup can survive losing the old provider completely before you decommission the old backup. For each product that reads from the shared store, document its data needs, its behavior during an outage, and its ability to drop traffic. Then plan for the cache-refill surge that follows when a recovered backend comes back online. When many services all try to refill their empty caches at once, that surge of requests can overwhelm a backend that is still stabilizing.
From the first signal to all-clear in 2h 28m.
The WARP team saw new device registrations fail, began investigating, and declared an incident.
Access received a rapid error-rate alert while multiple services breached their SLO targets.
Cloudflare combined the service incidents after identifying Workers KV unavailability and raised the priority to P1.
Responders upgraded the incident from P1 to P0 as the severity became clear.
Gateway began degrading rules that referenced identity or device-posture state to remove Workers KV dependencies.
Access and Device Posture dropped identity and posture requests until the third-party service returned.
Services began recovering, but cache repopulation still produced errors and infrastructure rate limits.
Cloudflare's service-level objectives returned to pre-incident levels and all affected services returned to normal operation.