FM-015Microsoft Azure2012-02-29impact 34h 15mSEV-1

The Leap-Day Arithmetic Error That Froze Every New Azure VM for 27 Hours

How a February 29 date calculation made transfer certificate creation fail for every new and reincarnated VM at midnight, why the Host Agent's hardware-fault heuristic turned a software bug into a service-healing cascade that propagated itself across healthy servers, and how a recovery package without connectivity tests took seven clusters from degraded to fully offline.

leap-year vm-startup certificate deploy cloud

citation

case study

At midnight on February 29, 2012, every new and reincarnated Azure VM hit a date arithmetic error in the Guest Agent. Adding one year to February 29 produced the invalid date February 29, 2013, and transfer certificate creation failed immediately. A transfer certificate is the credential a VM generates during initialization so it can register with the platform controller. Transfer certificate creation was the first step in Guest Agent initialization — required before any VM could register with the Fabric Controller. The failure hit new deployments, scale-outs, OS updates, and service-healing reincarnations simultaneously.

Windows Azure organized physical servers into clusters of roughly 1,000, each managed by a redundant Fabric Controller. The Fabric Controller was responsible for deployments, scaling, health monitoring, and automatic service healing. When the Guest Agent terminated silently before connecting, the Host Agent waited through three consecutive 25-minute timeouts. After those timeouts, it classified the server as having a probable hardware fault and moved it to Human Investigate state.

Before enough servers reached the Human Investigate threshold, service healing moved VMs to other servers. On those new servers, the same Guest Agent bug reproduced, cascading more servers into Human Investigate state. Critical alarms fired at 01:15 UTC, stopped the rollout, and paged Fabric Controller developers. Developers identified the leap-day bug at 02:38 UTC. At 02:55 UTC, Microsoft disabled service management in all clusters worldwide. This blocked customer deployments, scale-outs, and OS updates, preventing additional VMs from hitting the failing certificate path.

The Guest Agent had no retry or graceful-degradation path on certificate failure. When the date calculation produced an invalid date, the agent terminated immediately. There was no retry, no partial initialization mode, and no typed error surfaced to the Host Agent. The 75-minute timeout sequence created a window in which certificate failures accumulated across rolling clusters. Each affected VM took three 25-minute cycles to move its server to HI state, while clusters kept placing new VMs on additional servers, each beginning its own countdown.

Microsoft validated the Guest Agent fix in a test cluster at 06:50 UTC and confirmed it in one production cluster. At 10:11 UTC, Microsoft pushed the fix to all clusters. Seven clusters had just started their platform rollout when the bug struck. Most of their servers ran the old Host Agent and Guest Agent combination; some had already moved to the new version. For these clusters, Microsoft chose a blast update. The blast pushed the older Host Agent to all servers simultaneously, bypassing the standard multi-hour staged-domain rollout. The blast package included a networking plugin written for the newer Host Agent, incompatible with the older one it was paired with. The single-server pre-push test confirmed VMs started but did not verify network connectivity. At 10:47 UTC, the package disconnected every VM in all seven clusters from the network, including previously healthy ones. Access Control Service and Service Bus were deployed inside those seven clusters. When every VM disconnected, any application depending on those services lost them regardless of its own Compute footprint.

Disabling service management worldwide at 02:55 UTC was a containment decision with a clear rationale. Customer scale-out, deployment, and OS update operations all created new VMs, and every new VM would attempt the same failing certificate path. The tradeoff was visible — customers lost the ability to manage their applications — but the freeze preserved already-running multi-instance services that still had healthy capacity. The seven clusters were largely operational by 16:00 UTC, but the final healthy status did not arrive until 10:15 UTC on March 1. Servers left in corrupted intermediate states by repeated transitions required per-server manual restoration that automated tooling could not complete.

Microsoft's postmortem framed remediation across multiple layers. Time-boundary tests and typed error classification would catch date bugs and surface Guest Agent failures without the 75-minute timeout. Finer-grained service management controls would let individual aspects be disabled without a global freeze. Microsoft committed to investment in tooling to speed recovery from intermediate states.

timeline · UTC

From the first signal to all-clear in 34h 15m.

00:00 UTC

Leap-day certificate generation fails

Guest Agents in new VMs hit the leap-day bug when generating transfer certificates, producing the invalid date February 29, 2013.

01:15 UTC

Human Investigate threshold reached

In updating clusters, the Human Investigate threshold was reached after three 25-minute Host Agent timeouts. Critical alarms fired, stopped the rollout, and paged Fabric Controller developers.

02:38 UTC

Developers identify the bug

Fabric Controller developers identified the leap-day bug at 02:38 UTC, two and a half hours after it triggered. Detection took that long because the initial signal was server-health alarms, not certificate-generation errors.

02:55 UTC

Service management disabled worldwide

Microsoft disabled service management in all clusters worldwide to prevent customer actions from creating additional impact. Customers lost the ability to deploy, scale, or update applications, though already-running multi-instance services retained capacity.

06:50 UTC

Guest Agent fix validated in test cluster

Microsoft finished testing the updated Guest Agent in a test cluster.

10:11 UTC

Guest Agent fix pushed to all clusters

After one production cluster completed successfully, Microsoft pushed the Guest Agent fix to all clusters.

10:47 UTC

Recovery package disconnects seven clusters

Seven partially updated clusters received a blast of the older Host Agent. The package included a networking plugin written for the newer Host Agent, which was incompatible with the older one. The pre-push test had not caught the incompatibility because it only confirmed VMs started, not that they had network connectivity. At 10:47 UTC the package disconnected every VM in all seven clusters from the network, including VMs that had previously been healthy.

11:40 UTC

Corrected package tested with connectivity checks

Microsoft tested a corrected Host Agent package, this time verifying VM connectivity and other VM health checks.

13:23 UTC

Service management restored to majority of clusters

Microsoft announced that service management had been restored to the majority of clusters.

13:40 UTC

Corrected fix blasted to seven clusters

Microsoft began blasting the corrected Host Agent fix to the seven affected clusters.

16:00 UTC

Seven clusters largely operational

The seven clusters were largely operational again, but a number of servers remained in corrupted states requiring manual restoration and validation.

Mar 1, 10:15 UTC

All services reported healthy

Microsoft posted the final dashboard update that all Windows Azure services were healthy — more than 34 hours after the leap-day bug triggered.

lessons

What to take away.

Treat calendar boundaries as production failure modes for control-plane code, especially when date math creates credentials, leases, expirations, or scheduler decisions.The outage began when a simple one-year date transformation produced an invalid certificate date on February 29. The lesson is not just to add leap-day tests, but to inventory all time-derived control-plane artifacts whose failure blocks initialization or recovery. The cost is maintaining explicit boundary-case suites for paths that often look routine.

boundary_case_testing

Fail control-plane agents with typed errors before recovery systems reinterpret software initialization failures as hardware failures.Azure's Host Agent treated repeated Guest Agent connection failures as a server fault, which fed the service-healing loop. Typed error paths and fail-fast reporting reduce the chance that automated recovery makes a software bug look like infrastructure loss. The tradeoff is more explicit agent-state modeling and alert taxonomy.

typed_failure_signaling

Design break-glass controls that can disable the dangerous part of a control plane without freezing every customer management operation.Microsoft's worldwide service-management freeze stopped customers from triggering more failing VM creation paths, but also blocked benign management actions. Finer-grained controls preserve containment while reducing collateral impact. The boundary must be designed before the incident because responders cannot safely invent it mid-outage.

break_glass_controls

Recovery packages need behavior-level validation on the properties customers depend on, not only startup success on a representative node.The secondary outage happened because the test confirmed VMs started but did not verify VM network connectivity. Recovery validation should include the externally visible properties the fix is meant to restore, especially when responders bypass normal rollout safeguards. This adds time to emergency fixes, but it protects against turning repair into a new outage.

behavior_level_validation

Emergency recovery plans should preserve at least one staged validation point before a fix reaches every instance of a partially broken fleet.Microsoft chose blast updates because normal Update Domain deployment would take many hours, a rational tradeoff under active customer impact. The same choice made the incompatible package affect every VM in seven clusters at once. The portable lesson is to predefine faster emergency rollout ladders that are still staged enough to catch package incompatibility.

staged_emergency_rollout

sources

Read the sources.

Summary of Windows Azure Service Disruption on Feb 29th, 2012

Microsoft Azure Blog ↗

← previous

FM-006 · The `rm -rf` That Erased GitLab's Production Database

FM-018 · The Overheated AWS Zone