~/library/FM-018
FM-018AWS2026-05-08impact 20h 30mSEV-2

The Overheated AWS Zone.

The outage was not region-wide, and that is what made it instructive. A thermal event in one Availability Zone still disrupted EC2, EBS, and downstream services whose architectures could not cleanly leave it.

us-east-1use1-az4thermal-eventec2ebsavailability-zonescooling

The cloud failure was not a region disappearing. It was narrower and more revealing: one AWS Availability Zone got too hot, servers protected themselves by shutting down, and workloads that still depended on that zone lost the ground underneath them. For teams that treat a region as resilient by default, this is the sharper lesson. A zone can fail cleanly on the provider side while an application still fails messily because one critical dependency never left it.

AWS reported the event in use1-az4, one Availability Zone inside US-EAST-1. The affected hardware hosted EC2 instances and EBS volumes, so the blast radius included both compute and block storage. That combination matters. If an instance disappears, a replacement can often launch elsewhere. If the attached volume is impaired too, recovery depends on replicas, snapshots, or another storage path that already exists.

The event began on May 7, 2026, at 16:20 PDT, according to AWS updates later summarized by Network World. AWS identified impairments in use1-az4 at 00:25 UTC on May 8 and said other Availability Zones were not affected. As temperatures stayed high, AWS shifted traffic away from the impacted zone for most services, but recovery of affected EC2 instances and EBS volumes took longer because the physical environment had to become safe again before hardware could return.

That is what made the incident different from a bad deploy or a control-plane bug. There was no config rollback that could make overheated hardware safe. AWS had to bring additional cooling capacity online, wait for the environment to stabilize, restore power where it was safe, and recover infrastructure in phases. By 20:50 UTC on May 8, AWS reported that cooling capacity had returned to pre-event levels, which removed the physical blocker before the remaining recovery work finished.

Downstream impact depended less on whether a company used AWS and more on whether its critical path could survive the loss of that zone. ITPro reported Coinbase trading disruption lasting more than five hours, and other reports named CME Group and FanDuel. StatusGator correlated status pages and user reports across more than 150 cloud services with confirmed or likely AWS-related impact. The common pattern was not a region-wide outage. It was dependency concentration: compute, storage, brokers, databases, vendors, or recovery paths tied too tightly to the affected zone.

The useful takeaway is not that AWS customers should avoid US-EAST-1. It is that an Availability Zone is only a failure boundary if the application has no load-bearing dependency trapped inside it. A thermal event made the boundary visible: provider containment kept the physical problem zonal, but customer architectures still had to prove they could leave the failed zone without the failed zone's help.

EC2 instances and EBS volumes hosted on impacted hardware are affected by the loss of power during the thermal event.// AWS Health Dashboard update, quoted by ITPro and Network World

From the first signal to all-clear in 20h 30m.

00:20 UTC May 8
Thermal event begins affecting instances
AWS later said instance impairments in the affected zone began at 16:20 PDT on May 7. The issue was tied to loss of power during a thermal event in a single data center.
00:25 UTC
AWS identifies use1-az4 impairments
AWS reports instance impairments in use1-az4, one Availability Zone in US-EAST-1, and says other Availability Zones are not affected by the event.
01:47 UTC
Dependent services may also fail
AWS warns that services depending on the affected EC2 instances and EBS volumes in the zone may also experience impairments, and recommends using other Availability Zones where possible.
03:06 UTC
Recovery is slower than expected
AWS says work to bring additional cooling capacity online is taking longer than anticipated. Some services are improving, but remaining EC2 instances and EBS volumes still need controlled recovery.
05:11 UTC
Cooling systems show incremental recovery
AWS reports incremental progress restoring cooling systems while users continue seeing elevated error rates and latency for some workflows.
13:51 UTC
AWS explains the power loss
AWS says servers automatically shut down when temperatures exceeded operating thresholds to protect hardware. The power loss was the result of those protective shutdowns.
20:50 UTC
Cooling capacity restored
AWS reports that cooling system capacity returned to pre-event levels. This removed the physical blocker, but hardware and service recovery continued in phases.

A physical cooling failure exposed zonal concentration.

The failure started below the software stack. AWS reported increased temperatures inside a single data center in US-EAST-1, affecting resources in the use1-az4 Availability Zone. As temperatures exceeded operating thresholds, servers automatically shut down to protect hardware. That left EC2 instances and EBS volumes on impacted hardware impaired by power loss.

AWS contained the event at the Availability Zone level. Other Availability Zones were not affected by the thermal event, and AWS shifted traffic away from use1-az4 for most services. That containment limited the AWS-side blast radius, but containment did not automatically recover customer workloads that kept compute, storage, databases, brokers, or recovery paths tied to the affected zone.

Recovery depended on the physical environment before orchestration. AWS could not simply restart the affected hardware while temperatures remained above safe thresholds. Engineers had to bring additional cooling capacity online, restore the environment to safe operating conditions, and then recover EC2 instances, EBS volumes, and dependent service workflows in a controlled sequence.

What made a single-zone event visible across the internet.

01
The failed layer was physical infrastructure
A thermal event cannot be rolled back like a bad deployment. Until cooling capacity returned, AWS had to keep affected hardware powered down or impaired to protect it. That put a lower bound on recovery time.
02
EBS tied instance recovery to storage recovery
The event affected both EC2 instances and EBS volumes hosted on impacted hardware. Replacing compute alone is not enough when the attached block storage is degraded, unavailable, or recovering. Workloads that need both healthy instances and healthy volumes stay impaired until both layers recover or fail over.
03
Some service traffic could move faster than customer resources
AWS shifted traffic away from the affected zone for most services, and services such as IoT Core, Elastic Load Balancing, NAT Gateway, and Redshift improved earlier. Customers with EC2 instances and EBS volumes on affected hardware could still see resources as impaired until AWS completed hardware recovery.
04
Single-AZ assumptions persisted inside multi-AZ regions
The issue was zonal, not region-wide, but downstream impact was broad. Applications fail during a single-AZ event when a database primary, message broker, matching engine, cache, queue, or recovery process still depends on the failed zone.
05
US-EAST-1 carried heavy dependency weight
US-EAST-1 is one of AWS's most heavily used regions. A failure in one zone can show up across payment systems, data platforms, developer tools, analytics products, and consumer applications because many providers run critical pieces of their stack there.
06
Downstream visibility was fragmented
StatusGator found more than 150 cloud services with confirmed or likely AWS-related impact, but each provider described the problem differently. Some named AWS, EC2, EBS, US-EAST-1, or use1-az4; others used generic upstream-provider language. That fragmentation made the customer-visible blast radius clearer only after many status pages were correlated.

What to take from this incident.

01
Treat a single Availability Zone as a failure domain, not a capacity pool.Multi-AZ architecture only works if the entire workload can continue when one zone is unavailable. Check compute, storage, databases, caches, brokers, queues, load balancers, secrets, deployment automation, and manual recovery steps. One pinned dependency can turn a zonal event into an application outage.
02
Test failover while the original storage remains impaired.It is easier to recover when only compute disappears. Thermal and power events can impair block storage at the same time. Disaster recovery tests should assume the instance and its attached volume are both unavailable, then verify whether the service can restart from replicas, snapshots, or another zone without manual reconstruction.
03
Do not count traffic shifting as workload recovery.A provider can route managed-service traffic away from a zone while customer-owned resources inside that zone remain impaired. Your runbooks should distinguish between provider-level mitigation, service API recovery, and recovery of the resources your application actually depends on.
04
Include physical-layer failures in cloud resilience planning.Cloud outages are not only software bugs, control-plane failures, or networking events. Cooling, power, and facility constraints can dictate recovery time. Resilience plans should include scenarios where the provider cannot safely bring hardware back until the physical environment stabilizes.
05
Track dependency concentration outside your own architecture diagram.Your service may be multi-AZ while a vendor, payment processor, data pipeline, or hosted database dependency is concentrated in one zone or one region. Keep an external dependency map that includes third-party status signals and known regional concentration, then test what happens when those dependencies fail.

Read the original.

The AWS outage explained: What happened, who was impacted, and what services are back online?
itpro.com
AWS hit by US-East-1 outage after data center thermal event
networkworld.com
AWS outage takes down more than 150 cloud services
statusgator.com
← previous
FM-017 · The DNSSEC Failure That Made .de Look Fake
next →
FM-019 · The Encryption Path Under Slack Messages