FM-011Slack2022-02-22impact ~5hSEV-2

A Consul agent restart empties Slack's cache.

How a routine 25% Consul rollout — the third in a series — deregistered memcached nodes that the cache control plane then replaced with empty ones, why the cache miss storm landed hardest on client boot requests, and why fixing it took throttling, query rewrites, and a switch to replicas.

cacheservice-discoverydeploy

summary

Slack's cache control plane needed to know the difference between "gone" and "restarting." Consul told application servers where memcached nodes lived. Mcrib, Slack's cache control plane, watched that service catalog and promoted spare cache nodes when an active node disappeared. Spare nodes joined empty. If a memcached node left because it had failed, replacement was correct. If it left for a few seconds during an agent restart, Mcrib was swapping a healthy warm cache for an empty one, and Slack's database path was about to find out how much traffic the cache normally absorbed.

Slack was rolling out a new Consul agent across its fleet using a percentage-based rollout. The rollout upgraded the binary on 25% of hosts at a time, then performed sequential agent restarts so each host briefly left the Consul catalog and rejoined it. Two prior 25% steps had completed in earlier weeks without incident. On the morning of February 22, 2022, the third step ran on a slice of the fleet that happened to include a substantial number of memcached nodes.

As the agent restarted on each affected memcached host, the host left the Consul catalog. Mcrib saw the disappearance, promoted a spare cache node, and put an empty replacement in service. By the time the rollout had walked through enough memcached hosts, a meaningful portion of the cache had been quietly replaced with empty memory. The cache hit rate fell sharply. Client boot operations — which ran a scatter query across a user-sharded keyspace — started to miss in cache and hit the database at full fan-out. Database load climbed to a level that prevented cache fills from succeeding, and the system entered the self-reinforcing state where the cache could not warm up because the database could not serve the queries that would have warmed it.

The user-visible split was sharp. Customers with already-booted clients saw mostly normal service once the team had throttled new boots and quieted the database. Customers booting from cold could not connect, because the scatter query that ran during boot was exactly the path the database could no longer serve under cold-cache load. Mitigation came in three pieces: throttle the operation that was driving the scatter queries, rewrite the offending query to read only the missing keys and to use replicas as well as primaries, and then relax the throttle in small steps so the cache could refill without re-saturating the database.

The cache going cold was not the whole failure. Caches go cold. The sharper failure was that service discovery and cache management had no shared state for "this node is restarting briefly", so a planned agent bounce looked identical to a permanent loss. Mcrib did the right thing for the wrong situation, repeatedly, and the database design was implicitly funded by a cache that was being quietly replaced with empty boxes.

When the agent restart occurs on a memcached node, the node that leaves the service catalog gets replaced by Mcrib. The new cache node will be empty.// Slack, Slack's incident on 2-22-22

timeline · UTC

From the first signal to all-clear in ~5h.

Earlier weeks

Consul agent upgrade rolling out at 25% per step

Slack uses Consul for service discovery and is upgrading the agent across its fleet using a percentage-based rollout. Two prior 25% steps complete without incident in earlier weeks. The third 25% step is scheduled for February 22.

Feb 22, ~05:30 PT

Third 25% Consul rollout begins

The rollout updates the Consul binary on 25% of the fleet and then performs sequential agent restarts on those hosts. Sequential restarts mean each host briefly leaves the Consul service catalog and rejoins it.

Feb 22, 06:00 PT

Connection problem tickets begin

Users start opening tickets reporting they cannot connect to Slack. The pattern is uneven: users with already-booted clients are fine; users trying to boot fresh clients see failures.

Feb 22, ~06:15 PT

Cache hit rate collapses

The Consul agent restarts deregister memcached nodes from the service catalog as each host goes through the upgrade. Slack's cache control plane (Mcrib) promotes spare cache nodes to replace the missing ones. Spare nodes come up empty, so the cache hit rate drops sharply.

Feb 22, ~06:30 PT

Database overwhelmed by cache-miss scatter queries

With the cache cold, scatter queries across a user-sharded keyspace fall through to the database. Database load climbs to a level that prevents the cache from being refilled — the work that would warm the cache is the same work the database cannot keep up with.

Feb 22, ~07:00 PT

Client boot requests throttled to relieve database

Slack throttles client boot operations to reduce database load. Users with already-booted clients keep working in degraded fashion; users trying to boot are deferred. The team begins working on the offending scatter query.

Feb 22, ~08:30 PT

Scatter query rewritten to read only cache misses

Engineers rewrite the problematic scatter query so that it only reads the data missing from the cache, and updates the query to read from replicas as well as primaries. The narrower query, spread across replicas, is something the database can absorb.

Feb 22, ~10:00 PT

Throttle gradually relaxed; cache refills

With database load down, Slack relaxes the boot throttle in small increments. The cache refills as boot traffic flows back through. Hit rates recover; database load returns to baseline.

Feb 22, ~11:00 PT

Service returns to normal for booting users

Connection success rates return to baseline. The remaining work focuses on understanding why the cache replacement step came up empty and how to avoid it in future rollouts.

root cause

A cache control plane that did exactly what it was told.

Slack used Consul as the service catalog that told its applications where memcached nodes lived, and it used a separate cache control plane called Mcrib to keep the memcached pool the right shape. When a memcached node left the Consul catalog — for any reason — Mcrib's job was to promote a spare cache node into its place. Spare cache nodes joined the catalog with empty memory. The replacement was correct behavior for "a cache node is permanently gone"; it was the wrong behavior for "a cache node is being restarted for a few seconds during a rolling upgrade."

Slack was in the middle of a percentage-based rollout of a new Consul agent. Two prior 25% steps had completed in earlier weeks without incident. On the morning of February 22, the third 25% step ran on a slice of the fleet that happened to include memcached nodes. The rollout deregistered each memcached node from the Consul catalog while restarting its agent. Mcrib treated those deregistrations as cache nodes leaving, promoted empty spares in their place, and the working cache was suddenly cold across a substantial part of the keyspace.

The cache miss storm landed on a scatter query across a user-sharded keyspace, the kind of query that client boot operations run. Cold cache plus scatter query plus production traffic put more load on the database than it could serve while also refilling the cache. Mitigations had to reduce the load (throttle client boots), narrow the query (read only the missing keys, from replicas as well as primaries), and let the cache refill gradually. The incident resolved once enough of the cache was warm again that ordinary traffic could keep it warm.

contributing factors

What turned a 25% rollout into a five-hour boot failure.

Cache control plane could not distinguish 'restarting' from 'gone'

Mcrib treated any cache node leaving the Consul catalog as a node that had been lost permanently and promoted a spare in its place. There was no signal — heartbeat grace period, drain marker, or restart flag — that would have let Mcrib wait a few seconds for an agent restart to complete rather than starting from a cold spare.

Sequential Consul agent restarts touched memcached nodes

The Consul rollout's sequential agent restarts deregistered each affected host from the service catalog as it went. Because memcached nodes were part of the fleet being upgraded, each one took a turn leaving and rejoining the catalog. The rollout had run twice before without trouble, which gave no warning that this slice would include the cache.

Scatter query design amplified cache misses into database load

Slack's client boot path ran a scatter query across a user-sharded keyspace. With a warm cache, most of those reads hit memory. With a cold cache, the same query fanned out to every shard's database. The cost of the query was implicitly funded by the cache, and the database could not absorb the same query without that funding.

Cache-fill traffic could not get through while the database was hot

Once the database was saturated, queries that would have refilled the cache were the same queries the database could not serve. The system entered a self-reinforcing state: lower hit rate → higher database load → harder to refill the cache. Breaking the loop required reducing demand (throttling boots) before the cache could repopulate.

Previous successful rollouts created a false sense of safety

Two earlier 25% steps had completed without trouble, so the third was treated as routine. The composition of the fleet under each slice was different, though, and only the third happened to include enough memcached hosts to trigger Mcrib's failure mode at scale. Percentage rollouts hide the question of what is in each slice.

lessons

What to take from this incident.

Service discovery and the cache control plane need a shared grace period.A node leaving the service catalog for a few seconds during an agent restart should not look the same as a node that has been permanently lost. Heartbeat windows, drain markers, or restart flags can give the cache control plane the information it needs to wait through a planned bounce instead of starting from a cold replacement.

Treat percentage rollouts as a sample of the fleet, not the fleet.A 25% rollout is a 25% sample of the population. If a particular host type is rare, the first few steps may miss it entirely and then the next step hits a concentrated chunk. Plan rollouts so that each step includes a known mix of host roles, or run a separate rollout for the high-risk roles instead of trusting that the random slice will catch them.

Cache-funded queries should be measurable as cache-funded queries.Some queries are cheap because the cache makes them cheap. The system should know which queries those are, so that a cold cache event triggers protective behavior on those query paths — throttling, narrowing the query, or routing to replicas — before the database becomes the next thing to fail.

Be able to read from replicas under load, on demand.When the primary database is saturated, the ability to redirect read traffic to replicas is a working pressure-relief valve. That redirection has to be possible at query level, not just at database level, so that specific hot queries can move without changing every other read in the system.

Throttle the cause, not just the symptom.Throttling client boot operations reduced load on the database long enough for the cache to refill. That worked because boot was the source of the scatter queries, not just a victim of slow responses. Targeting throttles at the originating operation is more effective than throttling generic API traffic in front of it.

sources

Read the original.

Slack's incident on 2-22-22

slack.engineering ↗

← previous

FM-010 · Slack's first day back: a Transit Gateway runs out of room

FM-012 · Heroku's entire platform rides one AWS region down