FM-023Google Cloud2025-06-12impact 3hSEV-1

A quota policy update crashed Google Service Control globally.

A new quota-check path had shipped region by region, but it was only exercised after policy metadata changed. Blank fields in a globally replicated policy hit a null pointer, crashing Service Control tasks across regions.

configquotaspannernull-pointerpolicyglobal-outagegoogle-cloud

summary

Every Google API request passed through management and control planes that checked authorization, policy, and quota before the request reached the product endpoint. The core binary in that policy-check path was Service Control. As long as Service Control could read policy metadata and make a decision, Google Cloud and Workspace APIs could keep serving external requests.

The code that failed had already rolled out region by region. That sounds safe, but the dangerous path needed a later policy change before it ran. On June 12, 2025, Google inserted a quota policy update into regional Spanner tables. The metadata replicated globally within seconds, and some fields were blank in a way the new Service Control path did not handle.

When regional Service Control deployments evaluated the policy, the blank fields triggered a null pointer and the binaries entered crash loops. Google identified the root cause within about 10 minutes and prepared a red button to disable the serving path. Smaller regions recovered first after the bypass rolled out.

Recovery in larger regions showed a second failure mode. As Service Control tasks restarted in us-central1, they put herd pressure on the underlying Spanner table. Google throttled task creation and routed traffic to multi-regional databases to reduce load. A binary can pass regional rollout without incident, then fail globally when replicated metadata finally exercises the dormant path — because the path's first real execution waits on the data, not on the deploy.

This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.// Google Cloud incident report, June 2025

timeline · UTC

From the first signal to all-clear in 3h.

10:45 PDT

Policy update inserted

A policy change was inserted into regional Spanner tables used by Service Control. The metadata replicated globally within seconds.

10:49 PDT

Incident begins

Service Control began crashing as the new quota path processed policy data with unintended blank fields. Google reported increased 503 errors across affected products.

~10:59 PDT

Root cause identified

Within about 10 minutes, engineers identified the failing quota-check path and began preparing a red-button mitigation.

~11:29 PDT

Red-button mitigation rolls out

The serving path was disabled across regions. Smaller regions began recovering first, but large regions still had dependency pressure.

13:16 PDT

Most infrastructure recovered

Google reported recovery in all regions except us-central1, where the quota policy database had been overloaded by the restart herd.

13:49 PDT

Incident ends

Service Control and API serving recovered across all regions. Some product-specific residual backlogs continued afterward.

root cause

A dormant quota path met globally replicated bad data.

The immediate cause was invalid quota policy metadata containing unintended blank fields. When Service Control evaluated the policy, a quota-check path added in May dereferenced a null value and crashed. Because quota policy metadata replicated globally, the same failure was triggered across regional Service Control deployments.

The deeper cause was a mismatch between rollout safety and data activation. The binary had rolled out region by region, but the failing code path was not exercised until policy data changed. The feature was not protected by a feature flag, error handling was insufficient, and large-region recovery lacked enough randomized backoff to prevent dependency overload.

contributing factors

What turned invalid metadata into a global API outage.

The release did not exercise the failing path.

The code moved through regional rollout without error because it needed a later policy change to trigger the new quota logic. Deploy validation proved the binary could run, not that the feature could process production metadata.

Policy metadata replicated globally within seconds.

Quota management was intentionally global. That design made policy updates fast and consistent, but it also gave invalid metadata a near-immediate worldwide blast radius.

The emergency switch existed but still needed rollout.

A red button was available to disable the serving path, but it was ready roughly 25 minutes after the incident started and completed around 40 minutes from start. During that window, API requests continued failing.

Large-region recovery created a herd effect.

As Service Control tasks restarted in us-central1, they overloaded their underlying Spanner dependency. Missing randomized exponential backoff made restart traffic part of the recovery problem.

lessons

What to take from this incident.

Canary the data path, not only the binary.A feature is not live-tested until production-shaped data exercises it. Roll out policy, schema, and metadata changes through the same staged gates as code.

Feature-flag new policy checks before global metadata uses them.A red button is useful after failure, but a feature flag lets teams expose the path gradually by project, region, and internal tenant before broad activation.

Build restart backoff into every regional control plane.Recovery should assume many tasks restart at once. Use randomized exponential backoff, admission control, and dependency-aware throttling so a recovering service does not overload its own datastore.

sources

Read the original.

Google Cloud Service Health incident ow5i3PPK96RduMcb1SsW

status.cloud.google.com ↗

← previous

FM-022 · The Bot File That Crashed Cloudflare's Proxy

FM-024 · Azure Front Door config crashes edge sites