Library — Failure Modes — Failure Modes

/

sort

newsletter

Get the next incident.

One production failure case study every week. Read the pattern before it shows up in your own system.

2 results · filtered

topic: aws

id

incident

org

date

duration

severity

tags

When Slack's Autoscaler Made the Network Outage WorseOn the first Monday after the holiday break, an AWS Transit Gateway saturated under Slack's return-to-work traffic. Packet loss hit the web tier just as autoscaling tried to add 1,200 instances, and the provisioning service collapsed under its own quota and file-descriptor limits.

awstransit-gatewayautoscaling

When Heroku's Whole Platform Shared One AWS Failure DomainWhen AWS's US-East EBS storage entered a re-mirroring storm, Heroku's dynos, Heroku Postgres databases, and management API all failed together. The platform had no other region to run in and no path to recover without AWS.