FM-002GitHub2018-10-21impact 24h 11mSEV-1

A 43-second partition splits GitHub's database for a day.

How a routine 100G optical-equipment swap caused a 43-second link cut between GitHub's primary East Coast data center and the rest of its network, why Orchestrator promoted West Coast replicas during that window, and why the resulting writes on both coasts took 24 hours to merge back.

databasereplicationfailover

summary

GitHub's database failover depended on one condition: after a replica became primary, the old primary had to stop accepting writes. The production MySQL topology spanned three places. Primary clusters lived in the US East Coast data center, close to most of the application. Replicas lived in a US West Coast site and in public cloud regions, ready to take over if the East Coast was lost for more than a brief partition. Orchestrator managed that topology with Raft consensus. It could promote the West Coast. It could not fence the East Coast.

On the evening of October 21, 2018, a routine 100G optical-equipment swap on GitHub's US East Coast network cut the link between the East Coast hub and the primary data center. The cut lasted 43 seconds. That was long enough. The East Coast Orchestrator lost quorum, the West Coast and public-cloud Orchestrator nodes formed a majority, and West Coast replicas were promoted. The application tier began following the new topology. During those 43 seconds, the East Coast primaries — still healthy, still connected to the application servers that lived in the same data center — kept committing writes. When the link returned, both coasts had taken writes that the other had not seen.

The recovery was slower than the failure by orders of magnitude. The team paused webhooks and GitHub Pages builds at 23:19 UTC, then decided early on October 22 to restore East Coast primaries from backup and replay diverged writes instead of attempting an automatic merge. That choice avoided guessing which side was authoritative, but it exposed another constraint: an East Coast application writing to a West Coast primary across the country could not absorb the added round trip without degrading the user experience. Restoring multi-terabyte clusters took hours of decompression, checksumming, and preparation. Replicas caught up on a power-decay curve, not a straight line, because production traffic kept generating new writes the whole time.

The failover back to the original topology completed at 16:24 UTC on October 22. By then GitHub had accumulated more than 5 million queued webhook events and 80,000 GitHub Pages builds. Draining the queue took until 23:03 UTC, twenty-four hours and eleven minutes after the original cut. About 200,000 webhook payloads had aged past their internal TTL during the wait and were dropped.

The original mistake was not the equipment swap or the partition. Both are normal facts of running a network. The mistake was a failover protocol that promoted a new primary on the West Coast without making the old one on the East Coast unable to accept writes. Forty-three seconds of network blip on one side of the country, plus an unfenced primary on the other, plus a backup-and-replay recovery path that scaled with the size of the data — that is what twenty-four hours of GitHub.com degradation looked like.

The database clusters in both data centers now contained writes that were not present in the other data center.// GitHub, October 21 post-incident analysis

timeline · UTC

From the first signal to all-clear in 24h 11m.

Oct 21, 22:52 UTC

Optical-equipment swap cuts the East–West link

A routine 100G optical equipment replacement disconnects the link between GitHub's US East Coast network hub and its primary US East Coast data center. The cut lasts 43 seconds.

22:52 UTC

Orchestrator promotes West Coast primaries

GitHub uses Orchestrator with Raft consensus to manage MySQL topology. The East Coast Orchestrator node loses quorum; the West Coast and public-cloud nodes establish a majority and promote West Coast replicas to primary.

22:54 UTC

Partition heals; both coasts have written

Connectivity returns. The East Coast primaries had committed writes during the 43-second window before they were demoted. The West Coast primaries are now committing writes too. Both data centers hold writes the other does not.

22:54 UTC

Monitoring triggers

Replication and topology alerts fire. The application tier follows Orchestrator and starts directing writes to West Coast primaries, with the round-trip latency of a cross-country write path.

23:09 UTC

Site status moves to yellow

GitHub's external status changes to yellow as latency and error rates climb. The full scope is not yet understood.

23:11 UTC

Incident escalated to red

An incident coordinator joins, the topology divergence is confirmed, and the site moves to red. The team pauses webhooks and GitHub Pages builds at 23:19 UTC to stop new derived work from being generated against an inconsistent database state.

Oct 22, 00:05 UTC

Recovery plan: restore from backups, then failback

The team decides not to attempt an automated merge of the two write sets. The plan is to bring East Coast primaries back from the most recent backups, replay diverged writes carefully, and then fail back.

06:51 UTC

First clusters restored; cross-country latency slows the app

Some clusters come back, but applications running in the East Coast that depend on writing to West Coast primaries cannot absorb the additional cross-country round-trip and begin to drag the user experience down.

11:12 UTC

All primaries back in the East; replicas catch up slowly

Primaries are re-established in the East Coast data center. Dozens of read replicas remain hours behind. Catch-up replication follows a power-decay curve, not a straight line, because daily traffic keeps generating new writes.

16:24 UTC

Topology restored to the original layout

Failover back to the original East Coast primary topology completes. GitHub begins draining the work that piled up during the incident.

16:45 UTC

Backlog processing begins

The team starts processing more than 5 million queued webhook events and around 80,000 Pages builds. About 200,000 webhook payloads have already exceeded their internal TTL and are dropped.

Oct 22, 23:03 UTC

Backlogs drained, status returns to green

All queued webhook events and Pages builds are processed. GitHub's external status returns to green. Total incident duration from the link cut to status green: 24 hours and 11 minutes.

root cause

A failover designed for outages, triggered by a blip.

GitHub kept its primary MySQL clusters in its US East Coast data center, with replicas in a US West Coast site and in public cloud regions. Orchestrator managed topology across all three using Raft consensus. The design tolerated the loss of any one location: the remaining sites would keep a quorum and could elect a new primary if needed. A 43-second cut of the East Coast network hub was enough for the East Coast Orchestrator to lose quorum, while the West Coast and public-cloud nodes formed a majority and promoted West Coast replicas to primary.

The failover did not include a fence on the original primaries. During the 43 seconds the cut lasted, the East Coast primaries kept accepting writes from East Coast application servers, which never noticed the cut. When the link returned, both coasts had taken writes, and the West Coast clusters had begun ingesting application traffic against the new primaries. Each side now held writes the other did not. There was no safe automatic merge.

Recovery was bounded by the shape of the data. GitHub restored East Coast primaries from backup, then replayed the diverged writes in order before failing back. Backup restoration of multi-terabyte clusters took hours of decompression, checksumming, and preparation, and read replicas caught up on a power-decay curve while the rest of the world kept generating writes. By the time the original topology was restored, more than 5 million webhook events and 80,000 Pages builds had queued up, and about 200,000 webhook payloads had already aged past their internal TTL.

contributing factors

What stretched a 43-second partition into a 24-hour recovery.

Failover threshold tuned for outages, not blips

Orchestrator and the surrounding policy treated a brief loss of quorum the same as a longer outage and promoted on the West Coast quickly. A short delay or a confirmation step before promotion would have allowed a 43-second cut to heal without triggering failover.

No fence on the original primary

When West Coast replicas were promoted, nothing stopped the East Coast primaries from continuing to accept writes during the partition. Without STONITH, a quorum-gated write path, or another fencing mechanism, the old primaries kept committing transactions that were guaranteed to diverge from the new ones.

Application tier could not absorb cross-country writes

Applications running in the East Coast data center were not built to write to a West Coast primary at cross-country latency. When Orchestrator pointed them there, the additional round trip turned ordinary requests into slow ones, which constrained how long the team could leave the topology in the failed-over state.

Backup restore was slow at multi-terabyte scale

Recovery depended on restoring multi-terabyte clusters from backup, then replaying writes. Decompression, checksumming, and preparation took hours. The bandwidth of the restore path, not the bandwidth of the network, set the floor on recovery time.

Webhook and Pages backlogs grew faster than they could drain

Pausing derived work limited new divergence, but the work that had already queued kept piling up during the recovery window. Once the database came back, draining 5M webhooks and 80K Pages builds took hours and pushed about 200K webhook payloads past their TTL.

lessons

What to take from this incident.

Fence the old primary before you promote the new one.An automatic failover that promotes a replica without making the original primary unable to accept writes is one network blip away from split-brain. Use STONITH, quorum-gated commits, or a network ACL that the orchestrator can flip — and make the fence a precondition of the promotion, not a follow-up.

Match failover timeouts to the partition durations you actually see.If your link experiences 30–60-second blips during routine maintenance, an aggressive failover threshold will turn those blips into incidents. Look at the real distribution of partitions on your network and tune so brief cuts heal naturally while long ones still trigger failover.

Practice reconciliation on production-scale data before you need it.Reconciling two diverged MySQL clusters at multi-terabyte scale under pressure is a slow, error-prone exercise if the tooling has not been built and rehearsed. Treat split-brain recovery as a first-class disaster scenario: build the tools, run them against realistic data, and document expected wall-clock time.

Make derived work durable enough to survive recovery time.Webhooks and similar fan-out work need queues that hold long enough to outlast a worst-case database recovery. If your TTL is shorter than your DR target, an incident will silently throw work away even after the database is back.

Build the application tier so failover is survivable, not just possible.An application that can technically write to a remote primary but cannot do so without unacceptable latency is not actually failover-ready. Engineering an app for survivable failover means accepting the cross-region round-trip as a real operating mode, with timeouts, batching, and request shapes that work under it.

sources

Read the original.

October 21 post-incident analysis

github.blog ↗

← previous

FM-001 · A WAF rule pegs every Cloudflare CPU at once

FM-003 · The four-hour S3 typo