The Encryption-Key Request Load That Slowed Slack EKM
How cache hit ratio became a reliability boundary, why a provider investigation was not enough, and how load tests and backfills complicated recovery.
Slack's EKM incident was not a plain messaging outage. It exposed how encryption-key request load could slow messaging, channel loading, workflows, notifications, DMs, activity feeds, and file operations for customers using Enterprise Key Management. The second phase began around 5:55 AM PDT on May 11 and was under control at 9:16 AM PDT.
Enterprise Key Management makes encryption-key availability part of the product path for protected Slack workspaces. For those customers, the key path had enough shared fate to affect several visible Slack surfaces at once. Slack did not publish the complete EKGen topology or the identity of the third-party provider.
The public mechanism is still useful because Slack named the pressure points. On May 8, Slack reviewed cache configuration while it investigated a possible upstream-provider issue. It then deployed a fix that improved cache hit ratio and observed health metrics recovering. That made the cache layer a reliability boundary, because miss behavior could increase pressure on the key-service path.
The May 8 incident appeared resolved after the upstream provider implemented a fix and Slack observed stable monitoring. Three days later, Slack described the May 11 event as a recurrence of a previous incident. That recurrence is the strongest signal that the first recovery had not removed every condition that could reproduce customer impact.
The responder choices on May 11 followed the evidence Slack disclosed publicly. Teams were assessing pod scaling and a high-volume load-test workload that was contributing to EKGen traffic. They paused EKM backfills as a precaution to reduce load. They also re-engaged third-party infrastructure teams while EKM and EKGen engineers worked from an active bridge.
The May 11 mitigation combined load reduction with more cache capacity. Slack reduced the rate of KMS requests by approximately 50% and increased cache capacity. By 9:38 AM PDT, Slack said the incident had been stable for more than an hour with no new customer-facing impact detected.
The diagnostic gap matters as much as the mitigation. Slack's public updates show symptoms, provider investigation, cache work, additional load, and KMS request reduction. They do not show the internal metric or alert that separated those hypotheses.
Slack's final public update said fixes had been deployed for the root cause of elevated encryption-key request load. It did not publish a detailed remediation plan, provider identity, or durable architecture changes. The useful takeaway is narrower: key-request load, cache behavior, and provider dependency had to be understood together.
The transferable failure pattern is a security dependency becoming a shared capacity dependency. The warning signs are cache misses that reach scarce operations, background work sharing foreground capacity, and no customer-side workaround. The review question is whether the system can shed noncritical key traffic before customers experience the dependency directly.
From the first signal to all-clear in 3h 21m.
EKM customers may have experienced message-sending and channel-loading issues until about 11:10 AM PDT.
Slack said a possible upstream-provider issue might be affecting services while it reviewed cache configuration.
Slack deployed a fix that improved cache hit ratio and saw recovery in health metrics.
Slack said the upstream provider had implemented a fix and all services had been restored after stable monitoring.
Slack described a recurrence and said teams were assessing a high-volume load-test workload contributing to EKGen traffic.
Slack said impact was mitigated and the incident was under control at 9:16 AM PDT.
Slack said it had deployed fixes addressing the root cause of elevated encryption-key request load.