FM-019Slack2026-05-08impact 3h 21mSEV-2

The Encryption-Key Request Load That Slowed Slack EKM

How cache hit ratio became a reliability boundary, why a provider investigation was not enough, and how load tests and backfills complicated recovery.

enterprise-key-management kms caching third-party-provider messaging latency

Based on 1 sources ↓citation

case study

Slack's EKM incident was not a plain messaging outage. It exposed how encryption-key request load could slow messaging, channel loading, workflows, notifications, DMs, activity feeds, and file operations for customers using Enterprise Key Management. The second phase began around 5:55 AM PDT on May 11 and was under control at 9:16 AM PDT.

Enterprise Key Management makes encryption-key availability part of the product path for protected Slack workspaces. For those customers, the key path had enough shared fate to affect several visible Slack surfaces at once. Slack did not publish the complete EKGen topology or the identity of the third-party provider.

The public mechanism is still useful because Slack named the pressure points. On May 8, Slack reviewed cache configuration while it investigated a possible upstream-provider issue. It then deployed a fix that improved cache hit ratio and observed health metrics recovering. That made the cache layer a reliability boundary, because miss behavior could increase pressure on the key-service path.

The May 8 incident appeared resolved after the upstream provider implemented a fix and Slack observed stable monitoring. Three days later, Slack described the May 11 event as a recurrence of a previous incident. That recurrence is the strongest signal that the first recovery had not removed every condition that could reproduce customer impact.

The responder choices on May 11 followed the evidence Slack disclosed publicly. Teams were assessing pod scaling and a high-volume load-test workload that was contributing to EKGen traffic. They paused EKM backfills as a precaution to reduce load. They also re-engaged third-party infrastructure teams while EKM and EKGen engineers worked from an active bridge.

The May 11 mitigation combined load reduction with more cache capacity. Slack reduced the rate of KMS requests by approximately 50% and increased cache capacity. By 9:38 AM PDT, Slack said the incident had been stable for more than an hour with no new customer-facing impact detected.

The diagnostic gap matters as much as the mitigation. Slack's public updates show symptoms, provider investigation, cache work, additional load, and KMS request reduction. They do not show the internal metric or alert that separated those hypotheses.

Slack's final public update said fixes had been deployed for the root cause of elevated encryption-key request load. It did not publish a detailed remediation plan, provider identity, or durable architecture changes. The useful takeaway is narrower: key-request load, cache behavior, and provider dependency had to be understood together.

The transferable failure pattern is a security dependency becoming a shared capacity dependency. The warning signs are cache misses that reach scarce operations, background work sharing foreground capacity, and no customer-side workaround. The review question is whether the system can shed noncritical key traffic before customers experience the dependency directly.

timeline · UTC

From the first signal to all-clear in 3h 21m.

May 8, 10:55 AM PDT

Initial EKM impact begins

EKM customers may have experienced message-sending and channel-loading issues until about 11:10 AM PDT.

May 8, 12:20 PM PDT

Provider and cache paths enter the investigation

Slack said a possible upstream-provider issue might be affecting services while it reviewed cache configuration.

May 8, 12:54 PM PDT

Cache hit ratio improves

Slack deployed a fix that improved cache hit ratio and saw recovery in health metrics.

May 8, 4:27 PM PDT

May 8 phase resolves

Slack said the upstream provider had implemented a fix and all services had been restored after stable monitoring.

May 11, 7:22 AM PDT

Recurrence and load-test signal

Slack described a recurrence and said teams were assessing a high-volume load-test workload contributing to EKGen traffic.

May 11, 9:16 AM PDT

Impact is under control

Slack said impact was mitigated and the incident was under control at 9:16 AM PDT.

May 11, 2:48 PM PDT

Root-cause fixes are disclosed

Slack said it had deployed fixes addressing the root cause of elevated encryption-key request load.

lessons

What to take away.

Treat cache hit rate as a capacity and dependency-protection signal when misses drive scarce or vendor-backed operations.The incident tied customer impact to cache configuration, cache hit ratio, KMS request volume, and cache capacity. The portable practice is to monitor and alert on the miss path's downstream pressure, not only cache latency. The tradeoff is that higher cache capacity or stricter miss controls can increase staleness, memory cost, and operational complexity.

semantic_correctness_monitoringoperational_resilience

Keep high-volume tests and backfills visibly isolated from critical key-service paths, especially during incident recovery.Slack assessed a high-volume load-test workload contributing to EKGen traffic and paused EKM backfills to reduce load. The lesson applies when synthetic or background work shares capacity with customer-facing security or identity paths. The tradeoff is slower test feedback and delayed maintenance work when isolation or throttling is strict.

recovery_testingbehavior_level_validation

Predefine which background work can be shed first when a shared dependency starts serving customer-facing traffic poorly.Slack paused EKM backfills and reduced additional load during the recurrence. That suggests a useful operating pattern: name shed-able work in advance, test the switches, and define who can pull them. The tradeoff is accepting deferred consistency, delayed batch completion, or stale derived state while foreground traffic is protected.

blast_radius_controlrecovery_planning

Model third-party dependency trouble as a local capacity problem too, because retries, cache misses, and mitigation traffic can move the bottleneck back inside your system.Slack investigated with an upstream provider while also reviewing cache configuration, reducing KMS requests, increasing cache capacity, and reducing local load. The lesson applies to systems where external dependency health and internal traffic controls are coupled. The tradeoff is more complex incident models and runbooks, but the payoff is avoiding a single-provider-blame diagnosis.

vendor_blast_radius_auditdependency_modeling

For security-critical dependencies, design what customer-visible degraded mode means before the dependency fails.Slack reported no known workaround while EKM customers saw messaging, channel loading, workflow, file, and notification impact. The lesson is not to bypass encryption controls; it is to decide in advance which operations can queue, read stale metadata, fail closed, or present clearer customer status. The tradeoff is product complexity and the risk of creating unsafe fallback paths.

customer_visibilityblast_radius_control

sources

Read the sources.

EKM customers were previously experiencing issues with channel loading and message delivery

Slack Status ↗

← previous

FM-018 · The Overheated AWS Zone

FM-021 · Channel File 291 crashes Windows sensors