The Data Hall Cooling Failure Linked to 150-Plus Cloud Service Disruptions
How a chiller failure in one data hall shut down affected racks, why Coinbase's matching engine lost quorum after a single-building placement choice, and why recovery paths diverged between workloads, zonal data, and facility hardware.
A cooling failure in Northern Virginia took Coinbase's core exchange dark for more than five hours and disrupted CME Group and FanDuel. StatusGator linked the event to impairments across more than 150 downstream cloud services. Multiple chiller units failed simultaneously inside a single data hall in Northern Virginia. Temperatures climbed past operating thresholds and servers automatically shut down to protect hardware. Yet the physical incident was confined to use1-az4, one of six Availability Zones in US-EAST-1.
Inside that zone, EC2 instances and EBS volumes on the affected hardware were impaired. Workloads on the affected racks could not restart until cooling was restored. That restoration took roughly 20 hours and ran slower than AWS anticipated.
The dependency reach became visible quickly: AWS warned that services relying on EC2 or EBS resources in use1-az4 were also impaired. At 01:47 UTC, AWS shifted traffic away from the zone for most services. AWS later advised customers to restore from EBS snapshots or launch resources in unaffected zones. EBS volumes are scoped to one Availability Zone, so an instance in another zone cannot mount a volume from the failed zone. Customers with data on degraded volumes in use1-az4 could restore from a pre-existing EBS snapshot or wait for in-place hardware recovery.
Coinbase's incident report shows what that dependency reach looked like inside one customer's architecture. Coinbase used a Cluster Placement Group to keep its matching-engine nodes close together for low latency. That choice pinned the matching engine to a single building without automated failover to another Availability Zone. The matching engine used Raft consensus — a protocol that requires a majority of nodes to be reachable before the cluster can accept writes. When three of five Raft nodes lost power, the cluster lost quorum and cross-zone failover became impossible by design. Amazon MSK is AWS's managed Kafka service. Kafka distributes messages across topic partitions, and each partition is led by one broker. A defect in the AWS MSK control plane blocked automatic Kafka partition-leader reelection, compounding Coinbase's outage. This extended Coinbase's outage to fees, quotations, ledger, payments, and data pipelines.
While those failures accumulated, the AWS Health Dashboard described only symptoms: "Increased Error Rate and Latency." AWS did not publicly explain the physical mechanism, server shutdowns gated on cooling restoration, until more than 13 hours after impairments began. During the first half of the incident, customers could not know whether recovery depended on software, traffic routing, or facility infrastructure.
AWS also warned of longer-than-usual regional provisioning times, a signal that recovery capacity was constrained and new launches would not behave normally. Recovery moved in physical order: cooling first, then power to racks, then EC2 instances and EBS volumes, then dependent services. AWS's advisory to launch in unaffected zones was valid for stateless workloads with no zonal data anchors. Affected EBS volumes closed that escape hatch unless teams had already tested, permissioned, and rehearsed snapshot recovery.
Coinbase encountered that distinction directly: restoring its matching engine required engineers to build a completely new node group by hand. AWS engineers separately performed manual MSK partition reassignments. Coinbase trading fully resumed at 03:49 ET. Cooling returned to pre-event levels at 20:50 UTC on May 8. By 03:04 UTC on May 9, AWS had restored the majority of impaired EC2 instances and EBS volumes, while a small number remained impaired.
The public record ends unevenly because AWS disclosed the cooling failure but not its cause or any corrective actions. Coinbase supplied the concrete public follow-through: a warm cross-zone standby for the matching engine and regular production failover exercises. Coinbase also committed to migrating its 2-AZ Kafka cluster to 3-AZ and building custom MSK tooling and runbooks to handle future control-plane failures. The dependency graph around use1-az4 ultimately mattered more than the regional boundary. This is why Multi-AZ cannot be treated as a badge. It is a claim about every required dependency in the request path, including vendor APIs, managed service control planes, shared databases, and third-party platforms.
From the first signal to all-clear in 20h 30m.
EC2 instances and EBS volumes in use1-az4 began degrading at 00:20 UTC on May 8 as chiller units failed and temperatures climbed inside a single data hall.
AWS identified issues in use1-az4 by 00:25 UTC. The public health dashboard described the event as increased error rate and latency for EC2 in Northern Virginia.
AWS confirmed temperatures had risen within a single data center, causing instance impairments in use1-az4.
AWS shifted traffic away from the affected zone for most services and recommended customers use other Availability Zones, noting existing instances elsewhere remained unaffected.
Restoring temperatures was progressing more slowly than anticipated; incremental cooling restoration was required before affected services could recover. AWS directed customers to restore from EBS snapshots or launch replacement resources in other zones. This option required prior preparation — customers who had not pre-staged snapshots had no path.
AWS restored power to a subset of the affected infrastructure and observed stable signs of recovery.
Additional cooling capacity allowed recovery of some affected racks, with controlled recovery of the remaining racks underway.
More than 13 hours after impairments began, AWS explained that servers had automatically shut down when temperatures exceeded operating thresholds. This protective mechanism meant no workload could restart until cooling recovered.
IoT Core, ELB, NAT Gateway, and Redshift saw significant workflow recovery improvements while some EC2 instances and EBS volumes remained impaired.
Cooling capacity returned to pre-event levels. This completed the physical prerequisite for full hardware recovery, more than 20 hours after impairments began.
The majority of impaired EC2 instances and EBS volumes were restored. A small number remained impaired as end-to-end recovery concluded roughly 27 hours after impairments began.