An automation bug deschedules Google's network control plane.
How a maintenance event in one datacenter, combined with a bug that let automation deschedule multiple independent network control plane clusters across physical locations, took down Google Cloud, G Suite, and consumer products together — and why the rescue had to be debugged through the same congested network it was trying to fix.
Google's global backbone served both cloud customers and consumer products. Google Cloud Platform, G Suite, YouTube, Gmail, and Drive shared the same network fabric. The control plane managed routes, congestion, and traffic prioritisation; the data plane forwarded packets according to the last configuration it had received. The design could tolerate short control-plane absences by failing static — continuing to forward based on the last good state. If the control plane stayed absent too long or disappeared in too many places, BGP routes would withdraw, capacity to affected regions would shrink, and traffic prioritisation would degrade when customers needed it most.
On June 2, 2019, Google's datacenter maintenance automation began a planned event. The automation was supposed to deschedule the network control plane jobs only in the location under maintenance, so that other locations could carry the load. It had a bug. The bug let it deschedule multiple independent software clusters at once, even when those clusters were in different physical locations. At 18:45 UTC, the automation took advantage of the bug and descheduled network control plane instances in multiple locations simultaneously. End-user impact began two to four minutes later, as BGP routes managed by the now-absent control plane began to withdraw and capacity to the affected regions shrank.
The result was simultaneous degradation across products that share the network. Compute Engine, App Engine, Cloud Storage's regional buckets, BigQuery, Cloud Spanner, Cloud Pub/Sub, Cloud Endpoints, Cloud Interconnect, Cloud VPN, Cloud Console, and Stackdriver all showed errors or elevated latency. G Suite services on the same network showed impact. Consumer products did too. Third-party services that depend on Google APIs — and there are many — felt it as their own outage. Inside Google, engineering tools that needed network capacity to diagnose the problem competed for the same congested capacity that customers were on. The network was supposed to prioritise higher-class traffic over best-effort, and it could not enforce that prioritisation as well without the control plane present.
Identifying the trigger and halting the automation took about an hour and fifteen minutes, with the bug confirmed at 20:01 UTC. Stopping the automation did not bring the descheduled jobs back. The configuration data they had been managing had to be rebuilt and redistributed in the affected locations. That rollout began at 21:03 UTC. Network capacity to the affected regions started returning at 22:19 UTC. Full restoration came at 23:10 UTC, four hours and twenty-five minutes after end-user impact began.
The automation bug mattered because it exceeded the blast radius the network design assumed. The maintenance plan said one location at a time. The code allowed multiple locations at once. The fail-static behavior was sized for the plan, not for the code path that actually ran. Once the automation acted on its own definition of scope, the network entered a regime nobody had designed for, and recovery had to happen through the same congested network that was failing.
The software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.// Google, Google Cloud Networking Incident #19009
From the first signal to all-clear in 4h 25m.
An automation that did not know which clusters it was allowed to touch.
Google's network control plane jobs and their supporting infrastructure in any given physical location were configured to be stopped during a datacenter maintenance event in that location. The intent was that other locations would carry the load while one site was being worked on. Google's maintenance automation had a bug: it could deschedule multiple independent software clusters at once, even if those clusters were in different physical locations. On June 2, the automation took advantage of that bug and descheduled network control plane instances in multiple locations simultaneously.
The network was designed to fail static — to keep forwarding packets based on the last good configuration while the control plane was temporarily absent. Fail-static is enough for a few minutes during a routine descheduling. It is not enough when the control plane is absent in multiple locations at once and BGP routes that depended on it are withdrawn. Capacity to the affected regions shrank sharply within minutes. Congestion management without the control plane present could not prioritise customer traffic over best-effort traffic the way it normally would, and high-priority traffic degraded alongside the rest.
The recovery had to be debugged through the same congested network it was trying to fix. Engineers identified the bug, halted the automation, and then had to rebuild and redistribute the configuration data that the descheduled jobs had managed. The rebuild itself was slow because the network was loaded and the engineering tooling for diagnosis was competing for the same capacity as customer traffic. Total customer- visible impact ran 4 hours and 25 minutes from end-user impact to full recovery.