FM-025Cloudflare2025-06-12impact 2h 28mSEV-1

The Storage Failure That Broke 90% of Workers KV Requests

How a third-party storage outage cut off cold reads and writes, why cached reads survived while shared product dependencies failed, and how cache repopulation constrained recovery.

storage kv third-party edge cache vendor

Based on 3 sources ↓citation

case study

Workers KV failed 90.22% of its requests during Cloudflare's June 12, 2025 outage. Every Workers AI inference request failed. But requests served from Workers KV's cache continued returning successful or expected responses.

Cloudflare called Workers KV 'coreless,' meaning it ran independently at each location worldwide, but it still depended on a central data store. That central store was the single place where the real, up-to-date copy of every stored value was kept. Cloudflare's later documentation says writes go to central stores instead of every location's cache. A cold read happens when the local cache does not have the requested value and the request must go to central storage. A cold read traverses regional and central cache tiers before reaching those stores. Those 2026 documents explain the product model, but they do not establish the incident's exact storage topology.

Workers KV carried configuration, authentication, and asset-delivery state for many Cloudflare products. Access stored application policy and user identity information there. Workers AI used it to distribute routing and configuration globally. That level of reuse meant a single Workers KV failure would affect all of those unrelated products at the same time.

A third-party cloud provider outage disrupted part of Workers KV's underlying storage infrastructure. The central-store failure stopped cold reads and writes across KV namespaces used by Cloudflare services. Uncached requests that needed origin storage returned HTTP 500 or 503 errors. Edge caches contained the failure only for keys they already held.

Workers KV had removed one storage provider while re-architecting its backend, including a migration toward Cloudflare R2. Cloudflare said the transition still had a coverage gap. The provider outage exposed that unfinished resilience work, but it did not create the dependency concentration by itself.

The outage lasted two hours and 28 minutes and affected customers using the impacted services globally. Access failed every identity-based login across self-hosted, SaaS, and infrastructure applications. Pages error rates peaked near 100%. No Pages build completed during the incident window. Cloudflare reported no data loss and said the incident was not a security event.

Detection began at the product edge, with WARP registration failures rather than a published storage-layer signal. A service-level objective is a reliability target that a team commits to, such as keeping 99.9% of requests successful. Access error-rate alerts and service-level objective alerts triggered across multiple products, showing how broadly the failure had spread. Correlating those symptoms around Workers KV turned separate service incidents into one dependency incident. The published timeline names downstream alerts, but it does not identify a storage-layer alert or an initial responder hypothesis.

Priority rose as responders moved from a shared dependency diagnosis to the full cross-product severity. Because the outage duration was uncertain, Access treated another datastore as a contingency. Continued preparation showed that the alternate-backed release was not an immediate failover path. That constraint made product-level degradation and load shedding the available near-term controls.

Gateway normally failed closed when Workers KV could not provide identity or device-posture state. Gateway traded some identity-aware rule behavior for reduced dependence on unavailable state. Access made a different trade: reject identity-related work to reduce pressure on Workers KV. Cloudflare did not publish how much traffic these actions restored or whether load shedding accelerated recovery.

Turnstile used a kill switch so users were not blocked while its dependencies were unavailable. The switch also allowed valid tokens to be redeemed more than once, creating potential replay exposure. That choice preserved user passage by temporarily changing security semantics.

Recovery remained gated by the external storage dependency. Restoring the backend did not immediately clear the errors, because services around the world were all trying to refill their caches at once. KV calls and service-level objectives recovered in layers rather than all at once. Cloudflare confirmed that no stored data was lost, making this a service availability failure rather than a data loss event. Cloudflare proposed tooling to bring namespaces back online gradually, to avoid all services overwhelming the backend at the same time.

Cloudflare committed to removing Workers KV's dependency on any single storage provider. It also began reducing how many products fail together when one dependency goes down, and built tooling to restore KV namespaces gradually. By August 2025, Cloudflare reported completing a hybrid storage-provider rollout with improved redundancy. The dossier sources do not verify completion of the product-level work or progressive namespace controls.

The disclosed causal chain ends at an unnamed provider. Its upstream failure mechanism remains outside the published record. Cloudflare also left some unexpected CDN rerouting behavior under investigation. Neither gap changes the proven Workers KV mechanism, but both limit broader claims about the outage.

Even a worldwide service has a single point of failure if uncached reads and writes all depend on one central store. When switching storage providers, confirm the new setup can survive losing the old provider completely before you decommission the old backup. For each product that reads from the shared store, document its data needs, its behavior during an outage, and its ability to drop traffic. Then plan for the cache-refill surge that follows when a recovered backend comes back online. When many services all try to refill their empty caches at once, that surge of requests can overwhelm a backend that is still stabilizing.

timeline · UTC

From the first signal to all-clear in 2h 28m.

17:52 UTC

WARP registrations fail

The WARP team saw new device registrations fail, began investigating, and declared an incident.

18:05 UTC

Product and SLO alerts converge

Access received a rapid error-rate alert while multiple services breached their SLO targets.

18:06 UTC

Responders identify Workers KV as the shared cause

Cloudflare combined the service incidents after identifying Workers KV unavailability and raised the priority to P1.

18:21 UTC

Severity reaches P0

Responders upgraded the incident from P1 to P0 as the severity became clear.

19:09 UTC

Gateway degrades KV-backed rules

Gateway began degrading rules that referenced identity or device-posture state to remove Workers KV dependencies.

19:32 UTC

Access sheds KV load

Access and Device Posture dropped identity and posture requests until the third-party service returned.

20:23 UTC

Storage returns under cache pressure

Services began recovering, but cache repopulation still produced errors and infrastructure rate limits.

20:28 UTC

Service levels return to baseline

Cloudflare's service-level objectives returned to pre-incident levels and all affected services returned to normal operation.

lessons

What to take away.

Treat removal of a storage provider as a reliability migration: prove the replacement path under provider loss before retiring existing redundancy.This applies when changing storage providers, consistency models, or data-residency architecture. The telltale condition is an old provider being removed while replacement resilience is still in flight. Verification should exercise cold reads, writes, failover, and product dependencies under full provider loss. Keeping overlap costs money and can complicate consistency, so the migration needs explicit exit criteria rather than indefinite duplication.

vendor_blast_radius_audit

Maintain and exercise a dependency matrix that maps each product's configuration, identity, routing, and asset path to shared state services, including expected behavior when each dependency is unavailable.This applies when a common platform primitive is reused across many products. The telltale sign is one state service carrying unrelated control data and turning a backend fault into identity, routing, authentication, asset-delivery, and inference failures. Matrix exercises should validate fallback behavior per product rather than assuming backend redundancy alone bounds impact. The cost is keeping ownership and dependency data current as products evolve, so critical synchronous paths should be prioritized.

configuration_matrix_testing

Restore cache-backed namespaces in controlled cohorts with rate limits and priority ordering instead of releasing every cold cache against a recovered origin at once.This applies when many services repopulate from one recovering source of truth. The telltale condition is backend health returning while error rates and infrastructure limits persist under cache refill demand. Progressive enablement caps concurrency and prioritizes critical namespaces, reducing restart fan-in. The tradeoff is slower restoration for lower-priority traffic and the operational complexity of maintaining ordering, pause, and rollback controls.

staged_emergency_rollout

Predefine product-level load shedding for shared-dependency outages, including activation thresholds, ownership, customer-visible loss, and rollback conditions.This applies when shared failover may not be ready quickly enough to protect every dependent product. The telltale condition is product teams dropping requests or degrading dependency-backed features during the incident. Predefined controls make the sacrificed behavior explicit and give responders a bounded containment option. The tradeoff is deliberate loss of functionality or policy context, so each control needs a named owner, measurable trigger, customer-impact description, and tested restoration path.

break_glass_controls

Define degraded modes as explicit security behaviors and validate them end to end: which checks fail closed, which state is bypassed, and which replay or authorization risk is temporarily accepted.This applies when identity, policy, or abuse-prevention systems depend on shared state. A telltale sign is that dependency loss forces a choice between blocking legitimate users and relaxing a validation step. Behavior-level drills should verify the actual request outcome, not only that a kill switch toggles. Degradation can preserve availability, but it must be scoped and reversible because it may weaken enforcement or token semantics.

behavior_level_validation

Pair downstream SLO alerts with dependency-level cold-read and write probes plus correlation metadata so responders can identify a shared state-service failure before merging product incidents.This applies when many products synchronously depend on one state service. A telltale condition is simultaneous product alerts that reveal impact but not the shared failing layer. Dependency-tagged alerts and synthetic probes can shorten convergence while product SLOs preserve user-impact visibility. They add probe traffic and alert cardinality, and the public record does not establish that Cloudflare lacked internal storage telemetry, so this is a bounded recommendation rather than a finding of missing monitoring.

semantic_correctness_monitoring

sources

Read the sources.

Cloudflare service outage June 12, 2025

Cloudflare ↗

How KV works

Cloudflare ↗

Workers KV completes hybrid storage provider rollout for improved performance, fault-tolerance

Cloudflare ↗

← previous

FM-021 · Channel File 291 crashes Windows sensors

FM-029 · The Silent Merge Queue Corruption That Hit 658 GitHub Repos