~/library/FM-005
FM-005Fastly2021-06-08impact ~1hSEV-1

A latent CDN bug, woken by a valid config change.

How a software update Fastly deployed on May 12, 2021 sat dormant in production for nearly a month, why a fully supported customer configuration push tripped it across 85% of the global network in under a minute, and how Fastly detected it within 60 seconds without that being enough.

cdndeployconfig

Fastly's CDN made customer configuration global by design. A customer could update service behavior, and the edge would carry that configuration quickly to the points of presence that needed it. That speed was the product promise. It also meant every POP on the same software build shared the same exposure: if a valid configuration exercised a latent defect, the trigger could propagate at CDN speed.

On May 12, 2021, Fastly shipped a software update to its edge platform. The release passed pre-deployment testing and entered service across the global fleet without incident. It contained a bug, but the bug lived on a code path that no customer's live configuration was exercising. For twenty-seven days the defect sat in production with no traffic touching it, no errors firing, and no signal in any dashboard that anything was wrong.

On the morning of June 8, a Fastly customer updated their service configuration. The change was valid. The product documented it. Nothing about the change was unusual, and nothing in the platform suggested it would be different from any other configuration push. As Fastly's POPs pulled and applied the new configuration, each one hit the dormant May 12 defect. Within seconds, 85% of the network was returning errors. Within a minute, Fastly's own monitoring confirmed the scope.

For the next forty minutes, engineers worked back from the global failure to the customer configuration push, then to the dormant code path the push had activated, then to the underlying defect from May 12. Disabling the offending configuration let the edge return to a known-good state, and by 49 minutes after impact, 95% of the network was serving normal traffic again. A permanent software fix that removed the defective code path from the platform began rolling out later that day.

The customer change did not cause the outage by itself, and the May 12 release did not cause the outage by itself. The outage required both pieces to meet: a latent defect that survived in production long enough to look harmless, and a uniform global platform that propagated configuration changes with no staging boundary between one customer's valid input and a large share of the edge. The useful lesson is not that a bug shipped. Bugs ship. The useful lesson is that the platform had no place to absorb the bug between the moment it shipped and the moment normal customer behavior found it.

We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.// Fastly, Summary of June 8 outage

From the first signal to all-clear in ~1h.

2021-05-12
Software release ships with a latent defect
Fastly deploys a software update to its edge platform. The release contains a bug in a code path that handles a specific configuration combination. No customer's live configuration exercises the path, so the bug ships unnoticed and sits in production.
Jun 8, 09:47 UTC
Customer pushes a valid configuration; outage begins
A Fastly customer makes a configuration change that is fully supported and within the bounds of normal operation. The new configuration is the first to exercise the dormant code path from May 12. As Fastly's points of presence apply the new configuration, 85% of the network starts returning errors.
Jun 8, 09:48 UTC
Automated monitoring picks up the disruption
Fastly's monitoring detects the global error spike within one minute of impact. The on-call team is paged immediately.
Jun 8, 09:58 UTC
Public status post published
Fastly publishes the first status update acknowledging a global disruption while the team continues to trace the trigger.
Jun 8, 10:27 UTC
Engineering identifies the offending configuration
Engineers correlate the global failure with a recent customer configuration push and isolate the specific input combination that activates the dormant May 12 defect.
Jun 8, 10:36 UTC
Configuration disabled, services begin recovering
Fastly disables the problematic configuration. Edge POPs stop hitting the defective code path. Within 49 minutes of initial impact, 95% of the network is operating normally.
Jun 8, 12:35 UTC
Incident mitigated across the network
The remaining affected POPs return to normal operation. Customer impact ends, though the underlying defect in the May 12 software is still present.
Jun 8, 17:25 UTC
Permanent software fix begins rolling out
Fastly begins deploying a patch that removes the defective code path from the platform, so the same input combination cannot trigger the bug again.

A bug that needed one customer's input to wake up.

On May 12, 2021, Fastly shipped a software release to its edge platform. The release contained a bug on a code path that handled a specific configuration combination. No customer's live configuration exercised that combination, so the defect produced no errors, no alerts, and no traffic against it. For twenty-seven days the bug sat in production with nothing touching it.

On the morning of June 8, a customer pushed a configuration change that was valid, documented, and supported. It was also the first configuration in production to exercise the dormant code path. As Fastly's points of presence pulled and applied the new configuration, each one hit the same bug at almost the same moment. Within seconds, 85% of the global network was returning errors. Fastly's monitoring detected the disruption within one minute, but by then the customer-visible impact was already global.

The shape of the failure was the shape of the platform. Fastly ran the same software version on every POP and applied customer configuration changes uniformly. There was no canary stage for either side of the equation: not for the software release a month earlier, and not for the customer configuration push that triggered it. Once the bug was reached at one POP, every other POP running the same build was equally vulnerable, and the propagation of the configuration was as fast as the platform was designed to make it.

What turned a dormant defect into a near-total outage.

01
Test inputs did not cover the latent code path
The May 12 release passed Fastly's pre-deployment testing because no test case exercised the configuration combination that triggered the bug. Synthetic tests cannot enumerate the full space of valid customer inputs at CDN scale, so a defect on a niche code path can ship and remain invisible until a customer's real input finds it.
02
Customer configuration changes propagate globally at once
Fastly's edge applies new customer configuration to every POP that needs it, quickly and uniformly. That uniformity is part of the product. It also means a single push has no built-in staging if it interacts badly with a software version already running everywhere.
03
No automatic halt when edge serving processes began failing
When edge processes started returning errors at one POP, there was no platform-level signal that paused propagation of the configuration to others or rolled it back. The cascade moved at the speed of the configuration push, not at the speed of incident response.
04
Long dormancy hid the defect from operational signals
Twenty-seven days of normal traffic gave no indication that the May 12 release contained a latent issue. There was no proactive audit that re-exercised recently-released code paths against the current pool of customer configurations. The defect was invisible until a customer accidentally found it.
05
Uniform software version across the entire fleet
Every edge POP was running the same software build, so every POP shared the same vulnerability. Without a small slice of the fleet held back on a previous-generation build, there was no untouched capacity to keep serving while the new bug was diagnosed.

What to take from this incident.

01
Treat customer configuration as untrusted input on every code path it touches.Code that compiles, parses, or applies customer configuration sits on the boundary between the platform and the outside world. Fuzzing, property-based tests, and replay of real recent customer configurations against new builds belong in the standard pre-deployment gate. Hand-written unit tests cannot cover the input space.
02
Canary the platform, not just the code.Software rollouts often have canary stages, but configuration changes that flow through the platform usually do not. A POP-weighted canary for customer configuration would have caught this bug on a small slice before it reached the whole fleet. The unit of canarying should match the unit that can take the platform down.
03
Build automatic halt conditions into propagation, not just deployment.When edge processes fail or error rates climb sharply, the system propagating a change should stop on its own. A configuration rollout that has no way to stop is one that can only be stopped by humans, and humans cannot react faster than a global push.
04
Audit recently-shipped code paths against fresh production inputs.Dormant defects can survive in production for weeks because no current customer configuration exercises them. Regularly replaying real recent customer configurations against the latest build, separate from the canary, finds latent bugs before a customer accidentally finds them.
05
Detection in one minute is not the same as recovery in one minute.Fastly's automated monitoring picked up the disruption within sixty seconds. The remaining forty-eight minutes to 95% recovery were spent isolating the trigger and rolling it back. Plan recovery time on the assumption that detection is the start, not the end, of an incident — and on the assumption that some incidents resist diagnosis even when the signal is loud.

Read the original.

Summary of June 8 outage
fastly.com
← previous
FM-004 · Facebook withdraws its own DNS from the internet
next →
FM-006 · Accidental rm -rf deletes GitLab's production database