Azure Front Door propagated metadata its edge could not process
A valid sequence of customer configuration changes across two control-plane builds produced incompatible metadata. The metadata passed staged health checks, updated last-known-good state, then crashed the data plane asynchronously.
Azure Front Door sat in front of customer applications and Microsoft services as a global edge routing layer. Its control plane generated customer configuration metadata; its data plane consumed that metadata at edge sites to route traffic, apply purge operations, and enforce WAF behavior. The system needed configuration changes to propagate quickly, but only after edge servers had proven they could process them.
On October 29, 2025, a valid sequence of customer configuration changes crossed two different control-plane build versions. That sequence generated incompatible metadata. The metadata reached pre-production and later stages while health signals still looked positive because the crash happened during asynchronous data-plane processing, several minutes after load.
By the time edge sites started crashing globally, Azure Front Door's configuration protection system had already allowed the metadata through most of the fleet. Worse, the last-known-good snapshot now contained the same incompatible configuration. A direct rollback would have reintroduced the problem.
Microsoft blocked new configuration propagation, manually edited the latest LKG to remove the problematic customer configurations, deployed the repaired snapshot globally, and gradually rebalanced traffic as edge sites recovered. The timing gap — health checks advancing while asynchronous workers crashed — let bad config get written into the last-known-good snapshot before anyone could see it was bad. Once that happened, rollback meant editing the snapshot, not reverting to it.
This incompatibility triggered a crash during asynchronous processing within the data plane service.// Microsoft Azure PIR, October 2025
From the first signal to all-clear in 8h 24m.
A delayed data-plane crash outran staged validation.
The immediate cause was incompatible customer configuration metadata generated by a valid sequence of configuration changes across two different control-plane build versions. When edge servers processed the metadata asynchronously, a latent data-plane bug was exposed and edge sites crashed.
The deeper cause was that validation stages observed positive health signals before the asynchronous crash appeared. The configuration propagated through most of the fleet and was written into the last-known-good snapshot. That made simple rollback unsafe and forced operators to manually edit the LKG before redeploying it worldwide.