FM-010Slack2021-01-04impact ~3h 40mSEV-2

Slack's first day back: a Transit Gateway runs out of room.

How a holiday return surge plus a saturated AWS Transit Gateway pushed Slack's web tier into a panic, why autoscaling spent the morning trying and failing to add servers it could not provision, and why a load-balancer feature called 'panic mode' kept the service from going fully dark.

cloudscalingcascade

summary

Slack's web tier could absorb a traffic spike only if two other systems kept up. Cross-Availability-Zone traffic in US-East had to move through AWS-managed Transit Gateways fast enough that backend responses returned before Apache worker threads piled up. The internal provisioning service also had to configure and test new instances quickly enough for autoscaling to turn capacity requests into healthy servers. If the network slowed and provisioning stalled at the same time, autoscaling would keep asking for more servers while the platform could not finish adding them.

On the morning of January 4, 2021 — the first Monday back from the holiday break for much of the working world — the message send success rate dipped to about 99% just before 7 AM Pacific. Returning users hitting Slack with cold client caches piled load on the web tier at the same moment AWS networking inside the Transit Gateway began dropping packets. Web servers waited longer on backend responses. Apache worker threads piled up. CPU climbed. Autoscaling read the signals and started spinning up about 1,200 new instances.

The provisioning service had to bring those 1,200 instances up at once across a network that was already losing packets. It blew through the Linux open-file-descriptor limit on its host. It hit AWS API rate limits. Calls to its dependencies timed out. Instances that began the provisioning process could not finish it, and so they did not join the load balancer. Autoscaling kept trying to add capacity that nothing could turn into healthy servers. Slack's own monitoring became unreliable during this window because the dashboard and alerting services depended on the same Transit Gateways as the application.

What kept the service from going fully dark was a feature in Slack's load balancers called panic mode. When more than a threshold fraction of registered instances were failing health checks, the load balancer stopped respecting health checks and spread requests across every instance, on the assumption that some service on a partly broken fleet was better than none on a "healthy" empty pool. Panic mode plus retries and circuit breaking carried the customer experience while Slack disabled downscaling, cleaned broken instances out of the autoscaling groups, raised the file-descriptor limit, worked around AWS quota throttling, and got the provisioning service back into a state where new instances could finish their setup.

By 9:15 PT the web tier had enough healthy capacity that it was no longer saturated. Error rates dropped through 10:15. AWS completed a manual capacity increase on the affected Transit Gateways across all Availability Zones at 10:40, and packet loss cleared.

The traffic that caused this was not unprecedented. The first weekday after New Year's is one of Slack's heaviest concurrent-login windows every year. The holiday spike exposed the system shape: the autoscaler's response to a slow network was to flood the same network with provisioning traffic, the provisioning service had limits no one had hit under those conditions, and the load balancer's panic mode was the safety that kept the service alive while the rest of the system caught up.

Luckily, our load balancers have a feature called 'panic mode' which balances requests across all instances when many are failing health checks.// Slack, Slack's outage on January 4th, 2021

timeline · UTC

From the first signal to all-clear in ~3h 40m.

06:57 PT

Message send success drops below normal

Slack's message send success rate falls to around 99% — the normal baseline is above 99.999%. The drop coincides with an unusual pattern of returning users hitting Slack from the holiday break.

07:00 PT

Web tier saturates as packet loss climbs

AWS reports widespread packet loss in its networking infrastructure. A holiday-return mini-peak plus cold client caches multiply the load on Slack's web tier. Web servers spend longer waiting on backend responses and run out of Apache worker threads.

07:01 PT

Autoscaling fires for 1,200 new instances

Slack's autoscaling system reads the load signals and starts spinning up about 1,200 web servers across the next fifteen minutes. The provisioning service that configures and tests each new instance has to talk across a network that is dropping packets.

07:15 PT

Provisioning service collapses under its own limits

Trying to provision 1,200 instances simultaneously over a degraded network exhausts the Linux open-file-descriptor limit on the provisioning service. Calls to dependencies time out. AWS API rate limits are also hit. New instances cannot pass their configuration step and so never join the load balancer.

07:20 PT

Monitoring goes dark

Slack's dashboard and alerting service share VPC dependencies on the same Transit Gateways as the application. Monitoring becomes unreliable during the incident, slowing diagnosis.

07:30 PT

Load balancer panic mode keeps the service partially up

Slack's load balancers detect that a high fraction of registered instances are failing health checks and engage 'panic mode', which spreads requests across all instances regardless of health. The service is degraded but not down.

08:15 PT

Provision service repaired; new capacity begins joining

Slack disables downscaling to preserve any capacity that did make it in, clears broken instances out of the autoscaling groups, raises the file-descriptor limit, and works around AWS quota throttling. New, healthy instances start passing provisioning and joining the load balancer.

09:15 PT

Web tier has enough capacity; service degraded but improving

Enough healthy instances are now in service that the web tier is no longer saturated. Error rates fall and latency improves. The service is still degraded for some users while the system shakes out.

10:15 PT

Error rates back to low levels

User-visible error rates have dropped to low levels. Slack continues to work with AWS on the underlying network capacity.

10:40 PT

AWS finishes increasing TGW capacity

AWS completes a manual capacity increase on the affected Transit Gateways across all Availability Zones. Packet loss clears and Slack's network behavior returns to normal.

root cause

A network saturation event, met by the wrong autoscaling response.

Slack ran its US-East infrastructure in AWS, with traffic between Availability Zones flowing through AWS-managed Transit Gateways. The Transit Gateways are intended to scale transparently. On the first Monday back after the holiday break, returning users created a traffic shape one of the Transit Gateways could not absorb quickly enough. AWS networking began dropping packets, which raised latency between Slack's web tier and its backends and caused Apache worker threads to pile up waiting on slow requests.

Slack's autoscaling system saw the symptom — saturated workers, climbing CPU — and did the thing autoscaling does. It tried to add about 1,200 new web servers. Each one had to be configured and tested by Slack's internal provisioning service, which now had to do twelve hundred setups in parallel across a degraded network. The provisioning service blew past its open-file-descriptor limit on the host, hit AWS API rate limits, and could not finish bringing instances up. Autoscaling kept trying to add capacity that nothing could turn into healthy servers.

What kept the service from going fully dark was a feature in Slack's load balancers called panic mode. When more than a threshold fraction of instances are failing health checks, the load balancer stops respecting health checks and spreads traffic across every instance, on the assumption that some traffic surviving on a partly-broken fleet is better than no traffic on a "healthy" empty pool. Panic mode plus retries and circuit breaking held the service together while Slack repaired the provisioning service, AWS manually increased Transit Gateway capacity, and new instances finally joined the pool.

contributing factors

What turned a holiday return into a 3.5-hour degradation.

Transit Gateway capacity did not keep up with the spike

AWS Transit Gateways are designed to scale transparently, but the holiday-return traffic shape grew faster than the gateway adjusted. Slack had no advance signal of how close the gateway was to saturation, and no obvious way to pre-warm it from outside AWS.

Autoscaling response made the underlying problem worse

When the network slowed, autoscaling tried to provision twelve hundred new servers across the same degraded network. The corrective action competed with the workload for the same network capacity, and neither completed cleanly. The autoscaler had no signal that 'add more servers' was the wrong move when the network was the limit.

Provisioning service had host-level limits no one had hit before

The provisioning service was responsible for configuring and testing new instances, and had never before been asked to handle 1,200 in parallel. It ran into a Linux open-file-descriptor limit and an AWS API rate limit. Each was a known kind of limit; neither had a runbook to raise on the fly during an incident.

Monitoring depended on the same VPC and Transit Gateways

Slack's dashboard and alerting infrastructure shared VPC dependencies on the same Transit Gateways as the application. Monitoring became unreliable during the incident, slowing diagnosis. The tooling responders needed to understand the problem rode the same network that was broken.

Predictable annual surge had not been pre-provisioned

The first weekday after New Year's is consistently one of Slack's heaviest concurrent-login windows. The traffic pattern was known. Pre-warming web tier capacity, pre-warming the provisioning service, and coordinating with AWS on Transit Gateway capacity ahead of the surge would have removed the conditions the incident needed.

lessons

What to take from this incident.

Pre-provision for the calendar events you can see coming.Annual return-from-holiday traffic and other predictable surges are an opportunity to convert an incident into a non-event. Scale web tier capacity, warm the provisioning path, and coordinate cloud-side capacity (Transit Gateway, load balancer, NAT) in the days before the spike rather than fighting it on the morning.

Make autoscaling aware of the limit it is hitting.Autoscaling that responds to slow backends by adding more frontends will compound a network problem instead of resolving it. Augment autoscaling signals with network and dependency health, so the autoscaler stops adding capacity when adding capacity is not the answer, and consider rate-limiting how quickly it can grow during volatile periods.

Test the provisioning path at the rates incidents will demand.An autoscaler that asks the provisioning service to bring up 1,200 instances at once is a load test of that service. Run that test deliberately, find the file-descriptor and rate-limit ceilings, and either raise them ahead of time or document the runbook for raising them under pressure.

Build a load balancer 'panic mode' for the days health checks lie.When a high fraction of instances are failing health checks because of an external problem, sending traffic only to the 'healthy' pool sends it to almost nothing. A panic mode that spreads traffic across every instance during widespread health-check failures keeps the service partly available while the underlying issue is investigated.

Keep monitoring on a separate path from the system it monitors.If your dashboard, alerting, and on-call paging share VPC, network, or auth with the system in outage, you will lose them at exactly the worst time. Host the observability stack so it can keep working when the application's network does not.

sources

Read the original.

Slack's outage on January 4th, 2021

slack.engineering ↗

← previous

FM-009 · A telemetry rollout takes down ChatGPT for four hours

FM-011 · A Consul agent restart empties Slack's cache