~/library/FM-019
FM-019Slack2026-05-08impact 3h 21mSEV-2

The Encryption Path Under Slack Messages.

The Slack EKM incident was not a general chat outage. It exposed a narrower dependency: encryption-key traffic, cache pressure, background work, and a third-party provider all sat on the path customers needed to send messages and load channels.

enterprise-key-managementkmscachingthird-party-providermessaginglatency

Most users experience encryption as a promise about data protection. In production, it can also be a dependency under the send button. Slack's May 2026 EKM incident made that dependency visible: customers using Enterprise Key Management could have trouble sending messages, loading channels, receiving notifications, using workflows, opening DMs, and working with files because the encryption-key path was under pressure.

The first phase looked small. On May 8, Slack reported that EKM customers saw message sending and channel-loading issues between about 10:55 AM and 11:10 AM PDT. Error rates returned to normal, but Slack kept investigating. Less than an hour later, at 11:44 AM PDT, the problem returned as elevated error rates and latency for the same customer segment.

Slack did not point to one clean internal bug. Its updates show two response tracks moving in parallel: engineers suspected an upstream provider issue while reviewing Slack's own cache configuration. A provider can be part of the failure, but the application still controls how much traffic reaches that provider, how much work the cache absorbs, and which classes of work keep running when the path gets hot.

The cache became one of the main controls. Slack deployed a change that improved cache hit ratio and later reported recovery in health metrics. For a normal read path, a cache miss is often a performance cost. For an EKM path, a miss can mean another request into key-management infrastructure or a third-party dependency. Enough misses can turn a dependency that usually stays hidden into a user-visible bottleneck.

The May 11 recurrence made the pressure source clearer. Slack later summarized the renewed impact as elevated latency and errors caused by elevated encryption-key request load. The mitigation list tells the story: Slack paused EKM backfills, evaluated pod scaling, assessed a high-volume load test workload contributing to EKGen traffic, re-engaged third-party infrastructure teams, reduced KMS request rate by approximately 50%, and increased cache capacity.

Those actions show what was actually competing for capacity. Live customer actions needed key-management headroom. Backfills consumed it. A load test contributed traffic. Cache misses passed more work through to the constrained dependency. And the provider boundary added another place where Slack needed coordination instead of direct control.

Slack said customer impact stopped around 8:00 AM PT on May 11, and the incident was under control at 9:16 AM PDT. The important recovery pattern was demand reduction, not only provider repair. Slack lowered the rate of KMS requests and added cache capacity, which gave the key-management path room to recover while teams continued investigating the underlying cause.

EKM placed the encryption key path under the same workflows customers used every day — message sending, channel loading, file operations. When that path became a bottleneck, the security boundary held, but the product boundary did not. Key retrieval, backfills, load tests, and cache misses all drew from the same shared headroom, with no priority or rate limits separating them.

reducing the rate of KMS requests by approximately 50% and increasing cache capacity// Slack Status, May 8, 2026

From the first signal to all-clear in 3h 21m.

10:55 PDT May 8
EKM users see message and channel failures
Slack reports that between approximately 10:55 AM and 11:10 AM PDT, Enterprise Key Management customers may have experienced issues sending messages and loading some channels. Error rates returned to normal, but Slack continued investigating the cause.
11:44 PDT May 8
Elevated errors return
Slack reports renewed elevated error rates and latency for EKM customers. The affected experience includes sending messages and loading some channels, with no known workaround.
12:20 PDT May 8
Upstream provider and cache paths investigated
Slack says it believes an upstream provider issue may be affecting some services and is reviewing cache configuration in parallel to reduce impact.
12:54 PDT May 8
Slack improves cache hit ratio
Slack deploys a fix that improves cache hit ratio and observes recovery in health metrics while continuing to monitor the upstream provider.
16:27 PDT May 8
Provider fix restores service
Slack reports that its upstream provider implemented a fix, all services were restored after stable monitoring, and EKM customers should no longer experience access issues.
05:55 PDT May 11
EKM issue recurs
Slack later summarizes a recurrence beginning around 5:55 AM PDT on May 11, with elevated latency and errors across messaging, channel loading, workflows, and file operations for customers relying on EKM.
07:22 PDT May 11
Backfills paused and load test reviewed
Slack investigates a recurrence of the previous incident, pauses EKM backfills to reduce load, evaluates pod scaling, and assesses a high-volume load test workload contributing to EKGen traffic.
09:38 PDT May 11
KMS request rate reduced
Slack says mitigating actions reduced the rate of KMS requests by approximately 50% and increased cache capacity. The incident has been stable for over an hour with no new customer-facing impact detected.

The key-management path became part of Slack's serving path.

Enterprise Key Management tied Slack workspace data handling to customer-controlled encryption keys. For affected customers, that made the key-management path part of ordinary product workflows — message sends, channel loads, file operations — not only a security control running in the background. Elevated encryption-key request load was the pressure point the incident exposed.

The first phase on May 8 involved elevated error rates and latency for EKM customers. Slack investigated an upstream provider while also changing cache behavior, then reported improved cache hit ratio and later restoration after the provider implemented a fix. The public update does not name the provider or specify the exact provider-side fault.

The recurrence made the load path clearer. Slack paused EKM backfills, evaluated pod scaling, reviewed a high-volume load test workload contributing to EKGen traffic, reduced KMS request rate by approximately 50%, and increased cache capacity. Those mitigations point to a system where foreground work, background work, test traffic, cache misses, and provider calls could all compete for the same key-management headroom.

What made an encryption-key issue user-visible.

01
EKM was on interactive product paths
The impact was not limited to an administrative encryption console. Slack listed message sending, channel loading, notifications, Workflow, DMs, activity feeds, and file operations. For EKM customers, key-management latency showed up as Slack product latency.
02
Caching was a primary control
Slack reviewed cache configuration during the first phase, deployed a fix that improved cache hit ratio, and later increased cache capacity. When cache misses call into a constrained key-management path, the cache becomes a reliability boundary, not just an optimization.
03
Background work competed with foreground traffic
During the recurrence, Slack paused EKM backfills to reduce load. Backfills are usually operationally necessary, but they can become harmful when they draw from the same request budget needed by live user actions.
04
Synthetic load can look like production demand
Slack said teams were assessing a high-volume load test workload contributing to EKGen traffic. Load tests are useful only when they are isolated, rate-limited, and visible enough that responders can separate test pressure from customer traffic during an incident.
05
The provider boundary complicated diagnosis
Slack engaged a third-party provider on both phases of the incident. Provider involvement can be real root cause, partial cause, or only a constrained dependency exposed by internal traffic. The response had to move on both fronts: provider coordination and Slack-side load reduction.

What to take from this incident.

01
Map security dependencies onto availability paths.Security features often become runtime dependencies. For EKM, the key-management path was close enough to messaging and channel loading that latency and errors there showed up as Slack product failures. Treat key retrieval, encryption metadata, cache fill, and provider calls as part of the serving path, not only compliance infrastructure.
02
Give key-management traffic explicit budgets.Separate interactive requests, backfills, retries, and load tests with different rate limits and priorities. When the key path is under pressure, the system should shed or pause background work before customer actions start timing out.
03
Measure cache misses as dependency load.A cache hit ratio is not just a performance number when cache misses call a constrained key service or provider. Alert on miss rate, provider request rate, request fanout per user action, and cache capacity pressure together so responders can see whether the cache is protecting the dependency or amplifying it.
04
Make load tests incident-aware.High-volume tests should carry clear labels, owners, kill switches, and automatic stop conditions. During an incident, responders need to answer quickly whether a test is adding material load and stop it without searching through unrelated deployment or job systems.
05
Keep third-party mitigations and local mitigations independent.Waiting for a provider fix may be necessary, but it should not be the only recovery path. Slack also reduced request volume and increased cache capacity. Systems with provider dependencies need local controls that reduce demand while the provider-side investigation continues.

Read the original.

EKM customers were previously experiencing issues with channel loading and message delivery
slack-status.com
← previous
FM-018 · The Overheated AWS Zone
next →
FM-020 · The DynamoDB DNS Race That Emptied US-EAST-1