~/library/FM-024
FM-024Microsoft Azure2025-10-29impact 8h 24mSEV-1

Azure Front Door propagated metadata its edge could not process.

A valid sequence of customer configuration changes across two control-plane builds produced incompatible metadata. The metadata passed staged health checks, updated last-known-good state, then crashed the data plane asynchronously.

configedgecdnrollbacklast-known-goodasynchronous

Azure Front Door sat in front of customer applications and Microsoft services as a global edge routing layer. Its control plane generated customer configuration metadata; its data plane consumed that metadata at edge sites to route traffic, apply purge operations, and enforce WAF behavior. The system needed configuration changes to propagate quickly, but only after edge servers had proven they could process them.

On October 29, 2025, a valid sequence of customer configuration changes crossed two different control-plane build versions. That sequence generated incompatible metadata. The metadata reached pre-production and later stages while health signals still looked positive because the crash happened during asynchronous data-plane processing, several minutes after load.

By the time edge sites started crashing globally, Azure Front Door's configuration protection system had already allowed the metadata through most of the fleet. Worse, the last-known-good snapshot now contained the same incompatible configuration. A direct rollback would have reintroduced the problem.

Microsoft blocked new configuration propagation, manually edited the latest LKG to remove the problematic customer configurations, deployed the repaired snapshot globally, and gradually rebalanced traffic as edge sites recovered. The timing gap — health checks advancing while asynchronous workers crashed — let bad config get written into the last-known-good snapshot before anyone could see it was bad. Once that happened, rollback meant editing the snapshot, not reverting to it.

This incompatibility triggered a crash during asynchronous processing within the data plane service.// Microsoft Azure PIR, October 2025

From the first signal to all-clear in 8h 24m.

15:35 UTC
Corrupt metadata introduced
A specific sequence of customer configuration changes across two control-plane build versions generated incompatible metadata.
15:41 UTC
Customer impact begins
Edge data-plane servers began crashing as asynchronous processing encountered the incompatible metadata.
15:48 UTC
Monitoring triggers investigation
Azure monitoring detected widespread issues and engineers began investigating. The configuration protection system had already blocked new and in-flight customer configuration propagation.
17:40 UTC
Edited last-known-good deploy starts
Because the latest last-known-good snapshot contained the bad metadata, Microsoft edited the snapshot to remove problematic configurations and began global deployment.
18:30 UTC
AFD DNS servers recover
Azure Front Door DNS servers recovered, allowing traffic to be manually rebalanced to a small number of healthy edge sites.
00:05 UTC Oct 30
Customer impact mitigated
Availability and latency returned to pre-incident levels, and Microsoft confirmed the AFD impact was mitigated for customers.

A delayed data-plane crash outran staged validation.

The immediate cause was incompatible customer configuration metadata generated by a valid sequence of configuration changes across two different control-plane build versions. When edge servers processed the metadata asynchronously, a latent data-plane bug was exposed and edge sites crashed.

The deeper cause was that validation stages observed positive health signals before the asynchronous crash appeared. The configuration propagated through most of the fleet and was written into the last-known-good snapshot. That made simple rollback unsafe and forced operators to manually edit the LKG before redeploying it worldwide.

What turned a valid customer change into global edge failure.

01
Cross-version validation had a gap.
The incompatible metadata only appeared across two control-plane build versions. Pre-production validation did not cover all feature interactions across those versions.
02
Health checks were faster than the failure mode.
The protection system advanced after receiving healthy data-plane signals, but the crash surfaced after asynchronous processing. A one-minute stage delay was not long enough to catch this class of defect.
03
Rollback state was contaminated.
The last-known-good snapshot was updated after the bad configuration appeared healthy. Recovery therefore required editing the latest LKG rather than reverting directly to it.
04
First-party dependencies shared the same ingress.
Azure Portal and other Microsoft services used AFD. Some could fail away, but services without established fallback paths continued to fail even after the portal shell recovered.

What to take from this incident.

01
Bake configuration longer than its slowest failure mode.If parsing or asynchronous processing can fail minutes after load, staged rollout must wait long enough to observe that work. Health signals should include delayed worker failures.
02
Protect last-known-good snapshots from early promotion.Do not mark a configuration as rollback-safe until it has survived data-plane processing, traffic, and delayed jobs. Keep an older clean snapshot available when global propagation is still baking.
03
Separate config processing from traffic serving.Customer metadata should be validated and processed in isolated workers before it reaches live edge processes. A bad configuration should fail the pipeline, not crash request handlers.

Read the original.

Post Incident Review - Azure Front Door - Connectivity issues across multiple regions
azure.status.microsoft
← previous
FM-023 · Service Control quota update rejects APIs
next →
FM-025 · Workers KV storage dependency outage