The Encryption Path Under Slack Messages.
The Slack EKM incident was not a general chat outage. It exposed a narrower dependency: encryption-key traffic, cache pressure, background work, and a third-party provider all sat on the path customers needed to send messages and load channels.
Most users experience encryption as a promise about data protection. In production, it can also be a dependency under the send button. Slack's May 2026 EKM incident made that dependency visible: customers using Enterprise Key Management could have trouble sending messages, loading channels, receiving notifications, using workflows, opening DMs, and working with files because the encryption-key path was under pressure.
The first phase looked small. On May 8, Slack reported that EKM customers saw message sending and channel-loading issues between about 10:55 AM and 11:10 AM PDT. Error rates returned to normal, but Slack kept investigating. Less than an hour later, at 11:44 AM PDT, the problem returned as elevated error rates and latency for the same customer segment.
Slack did not point to one clean internal bug. Its updates show two response tracks moving in parallel: engineers suspected an upstream provider issue while reviewing Slack's own cache configuration. A provider can be part of the failure, but the application still controls how much traffic reaches that provider, how much work the cache absorbs, and which classes of work keep running when the path gets hot.
The cache became one of the main controls. Slack deployed a change that improved cache hit ratio and later reported recovery in health metrics. For a normal read path, a cache miss is often a performance cost. For an EKM path, a miss can mean another request into key-management infrastructure or a third-party dependency. Enough misses can turn a dependency that usually stays hidden into a user-visible bottleneck.
The May 11 recurrence made the pressure source clearer. Slack later summarized the renewed impact as elevated latency and errors caused by elevated encryption-key request load. The mitigation list tells the story: Slack paused EKM backfills, evaluated pod scaling, assessed a high-volume load test workload contributing to EKGen traffic, re-engaged third-party infrastructure teams, reduced KMS request rate by approximately 50%, and increased cache capacity.
Those actions show what was actually competing for capacity. Live customer actions needed key-management headroom. Backfills consumed it. A load test contributed traffic. Cache misses passed more work through to the constrained dependency. And the provider boundary added another place where Slack needed coordination instead of direct control.
Slack said customer impact stopped around 8:00 AM PT on May 11, and the incident was under control at 9:16 AM PDT. The important recovery pattern was demand reduction, not only provider repair. Slack lowered the rate of KMS requests and added cache capacity, which gave the key-management path room to recover while teams continued investigating the underlying cause.
EKM placed the encryption key path under the same workflows customers used every day — message sending, channel loading, file operations. When that path became a bottleneck, the security boundary held, but the product boundary did not. Key retrieval, backfills, load tests, and cache misses all drew from the same shared headroom, with no priority or rate limits separating them.
reducing the rate of KMS requests by approximately 50% and increasing cache capacity// Slack Status, May 8, 2026
From the first signal to all-clear in 3h 21m.
The key-management path became part of Slack's serving path.
Enterprise Key Management tied Slack workspace data handling to customer-controlled encryption keys. For affected customers, that made the key-management path part of ordinary product workflows — message sends, channel loads, file operations — not only a security control running in the background. Elevated encryption-key request load was the pressure point the incident exposed.
The first phase on May 8 involved elevated error rates and latency for EKM customers. Slack investigated an upstream provider while also changing cache behavior, then reported improved cache hit ratio and later restoration after the provider implemented a fix. The public update does not name the provider or specify the exact provider-side fault.
The recurrence made the load path clearer. Slack paused EKM backfills, evaluated pod scaling, reviewed a high-volume load test workload contributing to EKGen traffic, reduced KMS request rate by approximately 50%, and increased cache capacity. Those mitigations point to a system where foreground work, background work, test traffic, cache misses, and provider calls could all compete for the same key-management headroom.