FM-009OpenAI2024-12-11impact 4h 22mSEV-1

A telemetry rollout takes down ChatGPT for four hours.

How a telemetry service intended to improve observability instead made every node in every large cluster hammer the Kubernetes API server, why service discovery failed when the control plane saturated, and why the rollback path required the same API server that the rollout had just overwhelmed.

deploykubernetescascade

summary

The Kubernetes API server was not in front of ChatGPT users, but it kept the user-facing path alive. OpenAI's production clusters used it to schedule pods, restart unhealthy workloads, and serve DNS queries that applications needed to find their dependencies. As long as that control plane answered quickly enough, the data plane could keep moving traffic. When it saturated, scheduling stopped, health checks timed out, and service discovery broke at the same time. Existing pods could serve for a while, but only while the routes and dependencies they already knew remained valid.

On the afternoon of December 11, 2024, OpenAI deployed a new telemetry service across its production clusters. The intent was better observability of the Kubernetes control plane itself. The service was configured so that every node in each cluster executed a set of resource-intensive Kubernetes API operations on a regular interval. The cost of those operations scaled with the size of the cluster. On staging clusters, the cost was small. On the largest production clusters, with thousands of nodes, the aggregate load was enough to saturate the API servers within minutes of the rollout.

The cascade was straightforward once the control plane saturated. Pod scheduling stalled. Health checks took too long to complete. DNS-based service discovery, which depended on the same Kubernetes control plane, broke for the rest of the workloads in the cluster. Applications could not find their dependencies. ChatGPT began returning errors at 23:16 UTC. The API returned 503s. Sora stopped accepting requests. Engineering identified the deployment as the trigger within minutes and decided to roll it back.

The rollback was where the incident stopped being short. Removing a Kubernetes workload normally means asking the API server to do it. The API server was the thing that had failed. Standard kubectl commands could not get through, or got through and timed out. The team had to find direct access paths that bypassed the API server to remove the telemetry workload from the largest clusters by hand. As the offending operations stopped, the API server load came down and service discovery began to return. Pods started scheduling again, and the backlog of pending restarts drained gradually.

By 03:00 UTC on December 12, ChatGPT and the API were serving traffic again. Full resolution came at 03:38 UTC, four hours and twenty-two minutes after the first errors. A telemetry service was the trigger, but the durable lesson is the coupling: the workload's cost scaled with a production-only property of cluster size, the API server had no per-workload rate limit to protect itself, and the rollback path ran through the system that had failed. Any one of those conditions might have been recoverable in minutes. Together they turned a monitoring deployment into a four-hour outage.

A new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.// OpenAI, December 11, 2024 incident report

timeline · UTC

From the first signal to all-clear in 4h 22m.

Dec 11, 23:12 UTC

New telemetry service deployed to production

OpenAI deploys a new telemetry service intended to collect detailed Kubernetes control plane metrics. The service is rolled out across production clusters, not staged to a small slice first.

Dec 11, 23:16 UTC

Outage begins; API servers overwhelmed

Within four minutes of deployment, every node in each cluster begins executing resource-intensive Kubernetes API operations whose cost scales with the size of the cluster. On the largest clusters, the cumulative load saturates the Kubernetes API servers.

Dec 11, 23:20 UTC

ChatGPT, API, and Sora go down

Pod scheduling stalls, health checks fail, and DNS-based service discovery — which relies on the same Kubernetes control plane — breaks. ChatGPT begins returning errors, the API returns 503s, and Sora stops accepting requests.

Dec 11, 23:25 UTC

Full outage identified; rollback decided

Engineering identifies the telemetry deployment as the trigger and decides to roll it back. The rollback requires reaching the Kubernetes control plane to remove the offending service, but the control plane is the thing that has failed.

Dec 12, 00:30 UTC

Break-glass removal of telemetry workload

Engineers use direct access paths that bypass the normal Kubernetes API to remove the telemetry workload from the largest clusters. API server load begins to drop as the offending operations stop.

Dec 12, 01:30 UTC

Service discovery returns on the largest clusters

With API server load coming down, DNS-based service discovery starts to recover on the largest clusters. Pods begin scheduling again, but the backlog of pending restarts means recovery is gradual.

Dec 12, 03:00 UTC

ChatGPT and API serving traffic again

ChatGPT begins serving traffic and the API stops returning 503s on most regions. Some customers continue to see elevated errors as the rest of the fleet drains its backlog.

Dec 12, 03:38 UTC

Incident fully resolved

All services return to normal operation. Total duration from first errors to full recovery: 4 hours 22 minutes.

root cause

A telemetry workload that treated the control plane as a data plane.

OpenAI deployed a new telemetry service across its production Kubernetes clusters to collect detailed control plane metrics. The service's configuration caused every node in each cluster to execute resource- intensive Kubernetes API operations whose cost scaled with the size of the cluster. On OpenAI's largest clusters — thousands of nodes — the aggregate load overwhelmed the API servers. The control plane went from healthy to saturated within minutes of the deployment.

Once the API servers were saturated, the rest of the system unraveled in expected ways. Pod scheduling stalled. Health checks could not complete. DNS-based service discovery, which depended on the same control plane, broke. The application layer hollowed out faster than the control plane issue could be diagnosed: pods failing health checks did not get restarted, and services that needed to resolve new endpoints could not. ChatGPT, the API, and Sora went offline.

The rollback path went through the same control plane that had just failed. Removing a Kubernetes workload normally means asking the API server to do it; with the API server saturated, the normal path was not available. The team had to reach for direct access paths that bypassed the API server to remove the telemetry workload from the largest clusters before the control plane could recover. The time spent finding and using those break-glass paths is most of why an incident detected in minutes took more than four hours to clear.

contributing factors

What turned a monitoring deployment into a four-hour outage.

Cluster-wide deployment with no canary

The telemetry service was deployed to every production cluster at once, without a small-cluster or single-cluster canary. A canary against one cluster — even one of the largest — would have exposed the API server saturation on that cluster only and bounded the blast radius to it.

Workload cost scaled with cluster size, untested at full scale

The telemetry workload's per-node operations were inexpensive on small staging clusters and expensive on large production clusters. Behavior in staging did not predict behavior in production because cost scaled with the thing that was different between them. Load testing the workload on a production-shaped cluster would have caught the cost curve before the rollout.

No rate limiting on adversarial Kubernetes API load

The Kubernetes API server had enough capacity for normal cluster operations but not for an unthrottled telemetry workload generating cluster-scaled requests. There were no resource quotas or rate limits on the new service's API access that would have capped its impact regardless of its configuration.

DNS-based service discovery rode the same control plane

Pods inside the cluster relied on DNS resolution that ultimately depended on the Kubernetes control plane. When the API servers saturated, DNS resolution broke, and applications lost the ability to find their dependencies even when those dependencies were still running. The control plane was not just slow to schedule new work; it was also slow to answer basic resolution questions.

Rollback gated on the failing infrastructure

Removing a Kubernetes workload requires Kubernetes to be functional. When the control plane is the thing that is failing, the standard remediation path is unavailable. A pre-built break-glass procedure for direct workload removal would have shortened the time between identifying the trigger and stopping it.

lessons

What to take from this incident.

Deploy observability tooling with the same caution as application code.Monitoring and telemetry services interact directly with cluster infrastructure and can shape its load profile in ways application code cannot. They should go through the same canary process as production services, with explicit load testing against production-shaped clusters before they reach every cluster.

Rate-limit and quota every new Kubernetes API consumer.The Kubernetes API server supports rate limiting on watch and list operations and admission-time resource quotas per workload. Configuring those limits — especially on new services — prevents any single workload from saturating the control plane regardless of its configuration.

Test workloads at the cluster size you actually run.Workloads whose cost scales with cluster size, node count, or namespace count behave differently in staging than in production. Load testing the workload on a clone of the largest production cluster, or with a synthetic load that matches it, is the only way to learn the real cost before customers do.

Maintain a break-glass path that bypasses the failing control plane.When Kubernetes itself is the failure mode, kubectl operations do not work. Direct node access, pre-scripted workload removal procedures, or scripted etcd interventions can restore control plane headroom faster than waiting for the API server to come back. Build the procedure and rehearse it before the incident.

Treat service discovery as a separate failure domain from scheduling.When DNS-based service discovery rides the same control plane that schedules pods, a single saturation event takes both at once. Designing the discovery layer to keep answering known endpoints during control plane stress — or to have a fallback — reduces the surface area that goes dark when the API server slows down.

sources

Read the original.

OpenAI API, ChatGPT & Sora outage — December 2024

status.openai.com ↗

← previous

FM-008 · Cloudflare's control plane loses its primary facility

FM-010 · Slack's first day back: a Transit Gateway runs out of room