FM-014Google Cloud2019-06-02impact 4h 25mSEV-1

An automation bug deschedules Google's network control plane.

How a maintenance event in one datacenter, combined with a bug that let automation deschedule multiple independent network control plane clusters across physical locations, took down Google Cloud, G Suite, and consumer products together — and why the rescue had to be debugged through the same congested network it was trying to fix.

networkingautomationcloud

summary

Google's global backbone served both cloud customers and consumer products. Google Cloud Platform, G Suite, YouTube, Gmail, and Drive shared the same network fabric. The control plane managed routes, congestion, and traffic prioritisation; the data plane forwarded packets according to the last configuration it had received. The design could tolerate short control-plane absences by failing static — continuing to forward based on the last good state. If the control plane stayed absent too long or disappeared in too many places, BGP routes would withdraw, capacity to affected regions would shrink, and traffic prioritisation would degrade when customers needed it most.

On June 2, 2019, Google's datacenter maintenance automation began a planned event. The automation was supposed to deschedule the network control plane jobs only in the location under maintenance, so that other locations could carry the load. It had a bug. The bug let it deschedule multiple independent software clusters at once, even when those clusters were in different physical locations. At 18:45 UTC, the automation took advantage of the bug and descheduled network control plane instances in multiple locations simultaneously. End-user impact began two to four minutes later, as BGP routes managed by the now-absent control plane began to withdraw and capacity to the affected regions shrank.

The result was simultaneous degradation across products that share the network. Compute Engine, App Engine, Cloud Storage's regional buckets, BigQuery, Cloud Spanner, Cloud Pub/Sub, Cloud Endpoints, Cloud Interconnect, Cloud VPN, Cloud Console, and Stackdriver all showed errors or elevated latency. G Suite services on the same network showed impact. Consumer products did too. Third-party services that depend on Google APIs — and there are many — felt it as their own outage. Inside Google, engineering tools that needed network capacity to diagnose the problem competed for the same congested capacity that customers were on. The network was supposed to prioritise higher-class traffic over best-effort, and it could not enforce that prioritisation as well without the control plane present.

Identifying the trigger and halting the automation took about an hour and fifteen minutes, with the bug confirmed at 20:01 UTC. Stopping the automation did not bring the descheduled jobs back. The configuration data they had been managing had to be rebuilt and redistributed in the affected locations. That rollout began at 21:03 UTC. Network capacity to the affected regions started returning at 22:19 UTC. Full restoration came at 23:10 UTC, four hours and twenty-five minutes after end-user impact began.

The automation bug mattered because it exceeded the blast radius the network design assumed. The maintenance plan said one location at a time. The code allowed multiple locations at once. The fail-static behavior was sized for the plan, not for the code path that actually ran. Once the automation acted on its own definition of scope, the network entered a regime nobody had designed for, and recovery had to happen through the same congested network that was failing.

The software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.// Google, Google Cloud Networking Incident #19009

timeline · UTC

From the first signal to all-clear in 4h 25m.

Jun 2, 18:45 UTC

Automation descheduled network control plane jobs

Google's datacenter maintenance automation begins a planned event. A bug in the automation lets it deschedule network control plane jobs in multiple physical locations at once instead of in only the location under maintenance. The network control plane in the affected locations stops running.

Jun 2, 18:47 UTC

End-user impact begins

Within a couple of minutes of the descheduling, BGP routes managed by the now-absent control plane are withdrawn. Traffic capacity to the affected regions shrinks sharply. Packet loss climbs as the network falls back to a fail-static configuration that does not include the withdrawn capacity.

Jun 2, ~18:50 UTC

Google Cloud and consumer services degrade together

Compute Engine, App Engine, Cloud Storage regional buckets, BigQuery, Cloud Spanner, Cloud Pub/Sub, Cloud Endpoints, Cloud Interconnect, Cloud VPN, Cloud Console, and Stackdriver all show errors or elevated latency. G Suite services share the same network and are affected. YouTube, Gmail, and other consumer products also show impact. Third-party services depending on Google APIs report failures.

Jun 2, 19:30 UTC

Network debugging is itself slowed by congestion

Engineering tools used to investigate the network compete for the same congested capacity that customers do. The fail-static design keeps packets moving but cannot prioritise debugging traffic over user traffic without the control plane present.

Jun 2, 20:01 UTC

Root cause identified; automation halted

Engineers halt the maintenance automation that was descheduling the network control plane. The trigger stops, but the descheduled jobs still need to come back and the configuration data they managed has to be rebuilt.

Jun 2, 21:03 UTC

Configuration rollout begins

Engineers begin a configuration rollout to rebuild and redistribute the network configuration data that was lost when the control plane jobs were descheduled in multiple locations. The rollout has to be staged carefully to avoid creating additional traffic spikes during recovery.

Jun 2, 22:19 UTC

Network capacity recovery starts

Network capacity begins returning to the affected regions as the rebuilt configuration takes effect. Customer-visible error rates and latency improve.

Jun 2, 23:10 UTC

Full service restored

Google Cloud Platform, G Suite, and consumer products return to normal operation. Total duration from end-user impact to full recovery: 4 hours 25 minutes. Google publishes a detailed incident report.

root cause

An automation that did not know which clusters it was allowed to touch.

Google's network control plane jobs and their supporting infrastructure in any given physical location were configured to be stopped during a datacenter maintenance event in that location. The intent was that other locations would carry the load while one site was being worked on. Google's maintenance automation had a bug: it could deschedule multiple independent software clusters at once, even if those clusters were in different physical locations. On June 2, the automation took advantage of that bug and descheduled network control plane instances in multiple locations simultaneously.

The network was designed to fail static — to keep forwarding packets based on the last good configuration while the control plane was temporarily absent. Fail-static is enough for a few minutes during a routine descheduling. It is not enough when the control plane is absent in multiple locations at once and BGP routes that depended on it are withdrawn. Capacity to the affected regions shrank sharply within minutes. Congestion management without the control plane present could not prioritise customer traffic over best-effort traffic the way it normally would, and high-priority traffic degraded alongside the rest.

The recovery had to be debugged through the same congested network it was trying to fix. Engineers identified the bug, halted the automation, and then had to rebuild and redistribute the configuration data that the descheduled jobs had managed. The rebuild itself was slow because the network was loaded and the engineering tooling for diagnosis was competing for the same capacity as customer traffic. Total customer- visible impact ran 4 hours and 25 minutes from end-user impact to full recovery.

contributing factors

What turned a maintenance bug into a multi-product outage.

Automation could deschedule across physical locations

The maintenance automation's blast radius was not bounded to the location under maintenance. The bug let it deschedule independent software clusters in different physical locations at once. The design said maintenance was local; the code did not enforce that.

Fail-static is finite by design

The network was supposed to keep forwarding packets without a control plane for short periods. The design assumed the control plane would come back quickly. When the control plane was absent in multiple locations at once, the fail-static window was exceeded, BGP withdrew, and capacity dropped.

Consumer and cloud traffic share the same network fabric

Google's global backbone carries traffic for Cloud, G Suite, and consumer products together. A network event that hits the backbone has no product boundary. YouTube, Gmail, Drive, and GCP all see impact simultaneously, and third-party services that depend on Google APIs experience it as well.

Congestion management could not prioritise without the control plane

Under normal congestion, Google's network prioritises higher-class traffic over best-effort. Without the control plane, the network could not fully enforce that prioritisation, so high-priority customer traffic degraded alongside the rest. The protective mechanism customers paid for was control-plane-dependent.

Debugging tools competed with customer traffic for the same congested capacity

The same network that was failing was the network engineers had to use to diagnose the failure. Debug telemetry, control sessions, and recovery rollouts all competed for the constrained capacity. The incident took longer to resolve than it would have if observability and recovery paths had been protected from the customer-traffic congestion they were trying to fix.

lessons

What to take from this incident.

Bound the blast radius of maintenance automation to its intended scope.Automation that performs maintenance in one location should not be able to act in other locations. Make the location an explicit, validated parameter, and enforce it at the boundary the automation cannot cross. The check has to live in the executor, not in the maintenance engineer's intention.

Treat 'fail static' as a budget you can exhaust, not a property you have.A network that fail-statics for short outages can still fail when the control plane is absent for long enough or in enough places. Model the time budget the fail-static design actually provides, and design recovery to fit within it — including the case where the control plane needs to be rebuilt across multiple locations at once.

Keep the recovery path off the network that is failing.If engineers have to use the same congested capacity to debug a network problem that customers are using to be unhappy about it, the recovery slows down. Out-of-band paths for control sessions, debug telemetry, and recovery rollouts — even at limited bandwidth — keep diagnosis moving when the data plane cannot.

Preserve traffic prioritisation when the control plane is gone.If the high-priority traffic class only works when the control plane is present, the customers who paid for that priority lose it exactly when they need it most. Make prioritisation hold in the data plane based on pre-distributed configuration so it survives a temporary control plane outage.

Separate the consumer and cloud paths where the cost is acceptable.A shared backbone carries the operational benefit of one network and the blast-radius cost of one network. Where the cost matters — for example, where a Cloud customer SLA implies independence from consumer product incidents — architectural separation at the network layer reduces cross-product impact during events like this one.

sources

Read the original.

Google Cloud Networking Incident #19009

status.cloud.google.com ↗

← previous

FM-013 · A misrouted upgrade triggers an EBS re-mirroring storm

FM-015 · Leap year date bug disrupts Azure compute globally