FM-027GitHub2025-05-28impact 5hSEV-2

A runner cache bug delayed Ubuntu Actions jobs

After a backend failover, GitHub Actions runner pools for public Ubuntu-24 jobs began assigning duplicate work. Effective capacity dropped, queues grew, and nearly one fifth of affected hosted-runner jobs were delayed.

ci scheduler failover assignment github-actions cache

summary

GitHub Actions depends on a scheduler making a quiet promise: each queued job should be assigned to runner capacity once. The user-visible service can still accept workflows while that promise is broken, but queue time starts to drift because some runner capacity is wasted on duplicated work instead of distinct jobs.

On May 28, 2025, that is what happened to public repositories using Ubuntu-24 standard hosted runners. After a backend failover, a cache misconfiguration changed assignment behavior in the affected hosted-runner pools. The service did not fail as a clean outage. Jobs waited because duplicate job assignments reduced useful capacity.

The impact boundary mattered. GitHub said other hosted runners, self-hosted runners, and private repository workflows were unaffected. That made this a pool-specific capacity incident rather than a platform-wide Actions outage, even though affected public Ubuntu-24 workflows saw delayed starts for five hours.

GitHub fixed the backend cache configuration by 12:45 UTC and scaled the impacted pools to work through the backlog. Queueing impact ended at 14:45 UTC. A failover that leaves cache state inconsistent can pass availability checks while wasting capacity on duplicated work — the kind of failure that shows up as queue drift rather than error rates, and requires proving assignment uniqueness, not just API uptime.

Approximately 19.7% of Ubuntu-24 hosted runner jobs on public repos were delayed.// GitHub availability report, May 2025

timeline · UTC

From the first signal to all-clear in 5h.

09:45 UTC

Job starts begin delaying

GitHub Actions workflows in public repositories using Ubuntu-24 standard hosted runners began seeing delayed job starts.

~10:30 UTC

Queueing impact becomes visible

Runner queues grew as duplicate job assignments reduced effective capacity in the impacted hosted-runner pools.

~12:00 UTC

Backlog persists

The misconfiguration continued to suppress capacity, leaving public Ubuntu-24 jobs queued while other runner types remained unaffected.

12:45 UTC

Backend cache fixed

GitHub fixed the configuration issue through backend cache updates and began scaling the affected pools to drain queued jobs.

14:45 UTC

Queuing impact mitigated

The backlog cleared and delayed job-start impact was fully mitigated.

root cause

Failover left runner assignment cache behavior inconsistent.

The immediate cause was a backend cache misconfiguration after a failover. The cache behavior led to duplicate job assignments in the public Ubuntu-24 standard hosted-runner pools, which reduced effective available capacity and delayed job starts.

The deeper cause was that failover validation did not catch assignment uniqueness and capacity effects in the runner scheduler. The system could keep accepting work, but the pool's useful throughput fell because some assignment work was duplicated instead of consuming distinct queued jobs.

contributing factors

What turned failover cleanup into CI queueing.

The failure reduced capacity rather than stopping service.

Actions was not completely unavailable. Jobs queued and eventually ran, which can make detection and prioritization harder than a clean error-rate spike.

Impact was concentrated in one runner pool.

The incident affected public repositories using Ubuntu-24 standard hosted runners. Other hosted runners, self-hosted runners, and private repository workflows were unaffected, so aggregate platform signals could understate the pool-specific backlog.

Backlog needed extra capacity after the fix.

Correcting cache behavior stopped new duplicate assignments, but queued work still had to drain. GitHub scaled up the impacted pools to restore normal queue times.

lessons

What to take from this incident.

Test scheduler invariants after failover.Failover validation should include uniqueness, lease ownership, duplicate assignment rate, and useful throughput. A scheduler can be up while wasting capacity.

Monitor queue health per runner pool.Hosted CI capacity is segmented by image, visibility, size, and region. Alert on queue time, assignment duplication, and idle capacity at the same granularity users experience.

Plan for backlog drain as part of mitigation.Fixing the root cause does not restore user experience until the queue drains. Keep runbooks for temporary pool scaling and admission control during recovery.

sources

Read the sources.

GitHub Availability Report: May 2025

github.blog ↗

← previous

FM-026 · The WAF Killswitch That Crashed the Older Proxy

FM-029 · The Silent Merge Queue Corruption That Hit 658 GitHub Repos