FM-018AWS2026-05-08impact 20h 30mSEV-2

The Data Hall Cooling Failure Linked to 150-Plus Cloud Service Disruptions

How a chiller failure in one data hall shut down affected racks, why Coinbase's matching engine lost quorum after a single-building placement choice, and why recovery paths diverged between workloads, zonal data, and facility hardware.

us-east-1 use1-az4 thermal-event ec2 ebs availability-zones cooling

citation

case study

A cooling failure in Northern Virginia took Coinbase's core exchange dark for more than five hours and disrupted CME Group and FanDuel. StatusGator linked the event to impairments across more than 150 downstream cloud services. Multiple chiller units failed simultaneously inside a single data hall in Northern Virginia. Temperatures climbed past operating thresholds and servers automatically shut down to protect hardware. Yet the physical incident was confined to use1-az4, one of six Availability Zones in US-EAST-1.

Inside that zone, EC2 instances and EBS volumes on the affected hardware were impaired. Workloads on the affected racks could not restart until cooling was restored. That restoration took roughly 20 hours and ran slower than AWS anticipated.

The dependency reach became visible quickly: AWS warned that services relying on EC2 or EBS resources in use1-az4 were also impaired. At 01:47 UTC, AWS shifted traffic away from the zone for most services. AWS later advised customers to restore from EBS snapshots or launch resources in unaffected zones. EBS volumes are scoped to one Availability Zone, so an instance in another zone cannot mount a volume from the failed zone. Customers with data on degraded volumes in use1-az4 could restore from a pre-existing EBS snapshot or wait for in-place hardware recovery.

Coinbase's incident report shows what that dependency reach looked like inside one customer's architecture. Coinbase used a Cluster Placement Group to keep its matching-engine nodes close together for low latency. That choice pinned the matching engine to a single building without automated failover to another Availability Zone. The matching engine used Raft consensus — a protocol that requires a majority of nodes to be reachable before the cluster can accept writes. When three of five Raft nodes lost power, the cluster lost quorum and cross-zone failover became impossible by design. Amazon MSK is AWS's managed Kafka service. Kafka distributes messages across topic partitions, and each partition is led by one broker. A defect in the AWS MSK control plane blocked automatic Kafka partition-leader reelection, compounding Coinbase's outage. This extended Coinbase's outage to fees, quotations, ledger, payments, and data pipelines.

While those failures accumulated, the AWS Health Dashboard described only symptoms: "Increased Error Rate and Latency." AWS did not publicly explain the physical mechanism, server shutdowns gated on cooling restoration, until more than 13 hours after impairments began. During the first half of the incident, customers could not know whether recovery depended on software, traffic routing, or facility infrastructure.

AWS also warned of longer-than-usual regional provisioning times, a signal that recovery capacity was constrained and new launches would not behave normally. Recovery moved in physical order: cooling first, then power to racks, then EC2 instances and EBS volumes, then dependent services. AWS's advisory to launch in unaffected zones was valid for stateless workloads with no zonal data anchors. Affected EBS volumes closed that escape hatch unless teams had already tested, permissioned, and rehearsed snapshot recovery.

Coinbase encountered that distinction directly: restoring its matching engine required engineers to build a completely new node group by hand. AWS engineers separately performed manual MSK partition reassignments. Coinbase trading fully resumed at 03:49 ET. Cooling returned to pre-event levels at 20:50 UTC on May 8. By 03:04 UTC on May 9, AWS had restored the majority of impaired EC2 instances and EBS volumes, while a small number remained impaired.

The public record ends unevenly because AWS disclosed the cooling failure but not its cause or any corrective actions. Coinbase supplied the concrete public follow-through: a warm cross-zone standby for the matching engine and regular production failover exercises. Coinbase also committed to migrating its 2-AZ Kafka cluster to 3-AZ and building custom MSK tooling and runbooks to handle future control-plane failures. The dependency graph around use1-az4 ultimately mattered more than the regional boundary. This is why Multi-AZ cannot be treated as a badge. It is a claim about every required dependency in the request path, including vendor APIs, managed service control planes, shared databases, and third-party platforms.

timeline · UTC

From the first signal to all-clear in 20h 30m.

00:20 UTC

Instance and volume impairments begin

EC2 instances and EBS volumes in use1-az4 began degrading at 00:20 UTC on May 8 as chiller units failed and temperatures climbed inside a single data hall.

00:25 UTC

AWS identifies zonal impairments

AWS identified issues in use1-az4 by 00:25 UTC. The public health dashboard described the event as increased error rate and latency for EC2 in Northern Virginia.

00:53 UTC

Rising temperatures confirmed as cause

AWS confirmed temperatures had risen within a single data center, causing instance impairments in use1-az4.

01:47 UTC

Traffic shifts away from the zone

AWS shifted traffic away from the affected zone for most services and recommended customers use other Availability Zones, noting existing instances elsewhere remained unaffected.

03:06 UTC

Cooling recovery slower than anticipated

Restoring temperatures was progressing more slowly than anticipated; incremental cooling restoration was required before affected services could recover. AWS directed customers to restore from EBS snapshots or launch replacement resources in other zones. This option required prior preparation — customers who had not pre-staged snapshots had no path.

04:12 UTC

Power returns to some infrastructure

AWS restored power to a subset of the affected infrastructure and observed stable signs of recovery.

05:11 UTC

Additional cooling enables rack recovery

Additional cooling capacity allowed recovery of some affected racks, with controlled recovery of the remaining racks underway.

13:51 UTC

Protective shutdown mechanism explained

More than 13 hours after impairments began, AWS explained that servers had automatically shut down when temperatures exceeded operating thresholds. This protective mechanism meant no workload could restart until cooling recovered.

Afternoon, May 8

Managed services recover ahead of raw compute

IoT Core, ELB, NAT Gateway, and Redshift saw significant workflow recovery improvements while some EC2 instances and EBS volumes remained impaired.

20:50 UTC

Cooling capacity returns to pre-event levels

Cooling capacity returned to pre-event levels. This completed the physical prerequisite for full hardware recovery, more than 20 hours after impairments began.

03:04 UTC · May 9

Most impaired resources restored

The majority of impaired EC2 instances and EBS volumes were restored. A small number remained impaired as end-to-end recovery concluded roughly 27 hours after impairments began.

lessons

What to take away.

Performance optimizations that pin components to a physical location — such as Cluster Placement Groups — must be explicitly accounted for in resilience design, because the co-location constraint that enables low latency also removes zone-loss tolerance.Coinbase's post-mortem confirmed its matching engine ran in a Cluster Placement Group to minimize latency for trade consensus. When the data hall failed, the cluster lost quorum and automatic failover was impossible by design. The latency optimization was valid for a trading system, but its resilience implications had not been operationalized into a warm standby or failover procedure. The tension between latency and resilience should be a named architectural constraint, not a silent assumption.

multi_az_resilience

Periodically audit which of your service dependencies and vendor relationships are silently tied to a specific Availability Zone.StatusGator correlated more than 150 cloud services affected by a single-zone event. Many of those services likely considered themselves resilient, but their supply chains included components with zonal concentration they had not mapped. Dependency audits must extend to vendor chains, not only first-party infrastructure.

vendor_blast_radius_audit

Maintain a rehearsed ability to evacuate traffic from an impaired Availability Zone.AWS shifted traffic away from the affected zone for most services during mitigation. The public evidence does not quantify that action's effectiveness, so this is a grounded recommendation rather than a proven positive example.

zonal_traffic_evacuation

Keep snapshot restoration and unaffected-zone launch paths operational for urgent recovery.AWS advised customers needing immediate recovery to restore from EBS snapshots or launch resources in unaffected zones while the impaired hardware recovered. A recovery path is useful only if it is tested, permissioned, and fast enough to execute during an incident.

snapshot_based_recovery

Plan facility restoration and workload restoration as separate recovery stages.Cooling systems improved incrementally and eventually returned to pre-incident capacity, while some EC2 instances and EBS volumes remained impaired and dependent workflows recovered unevenly. Recovery plans should track each layer independently.

layered_recovery_planning

sources