FM-015Microsoft Azure2012-02-29impact 34h 15mSEV-1

Leap year date bug disrupts Azure compute globally.

On February 29, 2012, new Azure VMs tried to create one-year certificates ending on February 29, 2013 — a date that does not exist. The bug broke deployments and automated repair worldwide; later, a fast recovery package disconnected VMs in seven already-damaged clusters.

deployconfigcloud

summary

Before Azure could manage a new VM, the VM had to call home. A small management agent inside the VM connected back to the host so Azure could pass in secrets, check health, and finish bringing the workload under control-plane management. That first handshake depended on a short-lived transfer certificate. If the VM could create the certificate, it joined Azure's management loop. If it could not, Azure waited, retried, and eventually assumed the host machine was bad.

The certificate code did one ordinary-looking thing: take today's date and add one to the year. On most days, that made a valid one-year certificate. On February 29, 2012, it made February 29, 2013. There is no February 29 in 2013. The certificate failed before the VM could complete its first handshake with Azure.

That distinction matters. Already-running VMs did not all crash at midnight. The broken path was startup: new deployments, scale-out, operating-system updates, and automated repair. When Azure thought a server was unhealthy, it normally recreated that server's VMs somewhere else. On leap day, every recreated VM walked straight back into the same impossible date. In clusters already receiving platform updates, enough servers looked suspect that Azure hit its safety threshold and stopped automatic repair before it could keep digging the hole deeper.

Microsoft's first broad response was to disable service management worldwide. Customers could not safely create, update, or scale services while VM startup was broken; those actions would either fail or push more VMs into the same path. Turning off management protected many already-running applications, especially multi-instance services with enough healthy capacity left, but it also froze customers out of normal operations while engineers wrote and tested the fix.

Then recovery created a second incident. Seven clusters were already stranded partway through a platform rollout, with a mix of old and new management components. The normal update process would have moved slowly through update domains to preserve availability, but customers were already down and the clusters needed a special repair. Microsoft chose speed: roll back the host-side component, pair it with the fixed VM-side component, test one server, then push the package broadly across those seven clusters.

The package included the wrong networking plugin. It expected the newer host-side component, not the older one used in the repair package. The single-server test proved that a VM could boot, but it did not prove that the VM could still talk to the network. Azure's first failure came from date arithmetic that assumed every year had the same calendar shape. The second came from a recovery test that checked startup but missed the capability customers actually needed.

The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year.// Microsoft Azure incident summary, February 2012

timeline · UTC

From the first signal to all-clear in 34h 15m.

00:00 UTC

New VMs hit an impossible date

When new VMs start, Azure's in-VM management agent creates a transfer certificate. The code sets the valid-to date by adding one to the year, producing February 29, 2013. That date does not exist, so certificate creation fails and the agent exits.

01:15 UTC

Azure mistakes the bug for bad servers

Host agents wait 25 minutes for the VM agent to connect, then retry. After three failed attempts, they mark the server for human investigation. In clusters already receiving platform updates, enough servers hit that safety threshold that Azure disables automatic repair.

01:45 UTC

Operations detects compute impact

Azure operations sees compute failures across multiple regions and alerts the Fabric Controller developers. New deployments, scale-out, updates, and recovery operations are failing; Storage and SQL Azure are not affected by the leap-day bug.

02:38 UTC

Developers identify the leap-day defect

The team traces the failures to transfer certificate creation inside the VM agent. The invalid February 29, 2013 valid-to date explains why the problem began exactly at midnight UTC on leap day.

02:55 UTC

Service management disabled worldwide

Microsoft disables service management in all clusters to stop customers from triggering more failed deployments, updates, and scale-outs. Running applications can continue where enough instances are healthy, but customers cannot manage their services normally.

07:20 UTC

Fixed VM agent code ready

Engineers finish the VM agent fix and begin testing it. The fix must reach clusters without creating more instability in systems already degraded or stuck partway through an update.

10:47 UTC

Recovery update disconnects seven clusters

For seven partially updated clusters, Microsoft pushes an older host agent with the fixed VM agent. The package includes a networking plugin built for the newer host agent. The incompatible combination disconnects VMs from the network, including deployments for ACS and Service Bus.

13:23 UTC

Most clusters restored

Microsoft announces service management has been restored to the majority of clusters. The seven clusters affected by the incompatible recovery package still require repair.

10:15 UTC Mar 1

All Azure services healthy

After manually repairing and validating servers left in inconsistent states, Microsoft posts that all Windows Azure services are healthy. The incident spans from midnight UTC on February 29 to the final healthy update at 10:15 UTC on March 1.

root cause

An impossible certificate date, followed by a risky repair.

Every new Azure VM needed a small in-VM management agent to connect back to the host before Azure could finish managing it. That agent created a transfer certificate and set its end date by adding one to the current year. On February 29, 2012, the code tried to create a certificate ending on February 29, 2013. The date did not exist, certificate creation failed, the agent exited, and the host eventually treated the server as faulty.

The bug sat behind VM startup, not ordinary traffic to already-running VMs. New deployments, scale-out, operating-system updates, and automated repair all created new VMs or restarted the in-VM agent, so those operations began failing worldwide. In clusters already receiving platform updates, the repeated failures tripped Azure's safety threshold for "too many suspect servers," which stopped automatic repair before it could keep repeating the same bug.

Recovery introduced a second failure. Seven clusters were partway through a platform rollout, with a mix of old and new host and VM management agents. Microsoft chose a fast repair using an older host agent and the fixed VM agent, but the package included a networking plugin built for the newer host agent. The single-server test proved a VM could start, but not that it could communicate over the network. The incompatibility reached all seven clusters and disconnected VMs that had previously been healthy.

contributing factors

What turned a niche date bug into a global cloud outage.

No test coverage for February 29

The date calculation code had no test case for leap day inputs. A parametrised unit test covering February 28, February 29, and March 1 in leap and non-leap years would have caught the invalid valid-to date before the guest agent shipped. Calendar edge cases are a known failure category, not an exotic one.

Azure mistook a software bug for broken machines

When the VM agent failed to create a certificate, it exited before it could report a useful error. After three 25-minute timeouts, the host agent inferred a hardware problem and marked the server for human investigation. The safety threshold protected clusters, but the first diagnosis path still pointed at servers instead of the agent software.

Automated repair kept replaying the trigger

Azure normally recovered from a bad server by recreating its VMs on other servers. On leap day, each recreated VM started the same certificate path and hit the same impossible date. A recovery mechanism became another trigger for the failure it was trying to repair.

The recovery test proved boot, not networking

The seven partially updated clusters needed a special repair path because they had a mix of component versions. The team tested whether VMs started with the older host agent and fixed VM agent, but did not test whether those VMs had working network connectivity. The missing check let an incompatible networking plugin reach every VM in those clusters.

Incident pressure pushed the team toward a blast update

Normal host-agent and VM-agent updates took hours because Azure honored update domains to preserve customer availability. During the incident, Microsoft chose to update all servers in the seven damaged clusters at once. That reduced waiting time, but it also removed the staged rollout that would have exposed the networking incompatibility on a small subset first.

lessons

What to take from this incident.

Test date arithmetic against every calendar edge case, every time.Leap day, year rollover, DST transitions, and epoch boundaries are known failure points. Every function that constructs or shifts dates should have parametrised tests for those boundaries, running as part of the standard CI gate. The cost of the tests is small; the cost of missing them can be global.

Make startup failures specific before automated repair acts on them.A VM-agent startup failure, a hardware fault, and a network fault require different recovery paths. If the system cannot tell them apart, it may move workloads, restart VMs, or quarantine servers in ways that amplify a software bug. Fail fast with a precise error before automated healing begins.

Treat automated recovery paths as production features.Service healing, reincarnation, failover, and restart loops execute exactly when the system is least stable. They need the same edge-case testing as the primary path, including rare dates and version-mismatch states. A recovery path that repeats the trigger will spread the incident.

Have fine-grained controls for protecting customers during an incident.Microsoft disabled service management worldwide to stop customers from triggering failed deployments, updates, and scale-outs while VM startup was broken. That protected already-running applications, but it also blocked customers from normal management operations. Incident response controls should let operators disable the dangerous action without turning off more of the control plane than necessary.

Blast updates need a deliberate compensating check.Sometimes the fastest safe option during an incident is broader than the normal rollout policy. When that happens, add a check that replaces the safety you removed: one cluster, one update domain, one complete health test, then the wider push. Speed without that gate turns the recovery action into a second incident.

sources

Read the original.

Summary of Windows Azure Service Disruption on February 29th, 2012

azure.microsoft.com ↗

← previous

FM-014 · An automation bug deschedules Google's network control plane