After a backend failover, GitHub Actions runner pools for public Ubuntu-24 jobs began assigning duplicate work. Effective capacity dropped, queues grew, and nearly one fifth of affected hosted-runner jobs were delayed.
GitHub Actions depends on a scheduler making a quiet promise: each queued job should be assigned to runner capacity once. The user-visible service can still accept workflows while that promise is broken, but queue time starts to drift because some runner capacity is wasted on duplicated work instead of distinct jobs.
On May 28, 2025, that is what happened to public repositories using Ubuntu-24 standard hosted runners. After a backend failover, a cache misconfiguration changed assignment behavior in the affected hosted-runner pools. The service did not fail as a clean outage. Jobs waited because duplicate job assignments reduced useful capacity.
The impact boundary mattered. GitHub said other hosted runners, self-hosted runners, and private repository workflows were unaffected. That made this a pool-specific capacity incident rather than a platform-wide Actions outage, even though affected public Ubuntu-24 workflows saw delayed starts for five hours.
GitHub fixed the backend cache configuration by 12:45 UTC and scaled the impacted pools to work through the backlog. Queueing impact ended at 14:45 UTC. A failover that leaves cache state inconsistent can pass availability checks while wasting capacity on duplicated work — the kind of failure that shows up as queue drift rather than error rates, and requires proving assignment uniqueness, not just API uptime.
Approximately 19.7% of Ubuntu-24 hosted runner jobs on public repos were delayed.// GitHub availability report, May 2025
Failover left runner assignment cache behavior inconsistent.
The immediate cause was a backend cache misconfiguration after a failover. The cache behavior led to duplicate job assignments in the public Ubuntu-24 standard hosted-runner pools, which reduced effective available capacity and delayed job starts.
The deeper cause was that failover validation did not catch assignment uniqueness and capacity effects in the runner scheduler. The system could keep accepting work, but the pool's useful throughput fell because some assignment work was duplicated instead of consuming distinct queued jobs.
The failure reduced capacity rather than stopping service.
Actions was not completely unavailable. Jobs queued and eventually ran, which can make detection and prioritization harder than a clean error-rate spike.
02
Impact was concentrated in one runner pool.
The incident affected public repositories using Ubuntu-24 standard hosted runners. Other hosted runners, self-hosted runners, and private repository workflows were unaffected, so aggregate platform signals could understate the pool-specific backlog.
03
Backlog needed extra capacity after the fix.
Correcting cache behavior stopped new duplicate assignments, but queued work still had to drain. GitHub scaled up the impacted pools to restore normal queue times.
Test scheduler invariants after failover.Failover validation should include uniqueness, lease ownership, duplicate assignment rate, and useful throughput. A scheduler can be up while wasting capacity.
02
Monitor queue health per runner pool.Hosted CI capacity is segmented by image, visibility, size, and region. Alert on queue time, assignment duplication, and idle capacity at the same granularity users experience.
03
Plan for backlog drain as part of mitigation.Fixing the root cause does not restore user experience until the queue drains. Keep runbooks for temporary pool scaling and admission control during recovery.