A telemetry rollout takes down ChatGPT for four hours.
How a telemetry service intended to improve observability instead made every node in every large cluster hammer the Kubernetes API server, why service discovery failed when the control plane saturated, and why the rollback path required the same API server that the rollout had just overwhelmed.
The Kubernetes API server was not in front of ChatGPT users, but it kept the user-facing path alive. OpenAI's production clusters used it to schedule pods, restart unhealthy workloads, and serve DNS queries that applications needed to find their dependencies. As long as that control plane answered quickly enough, the data plane could keep moving traffic. When it saturated, scheduling stopped, health checks timed out, and service discovery broke at the same time. Existing pods could serve for a while, but only while the routes and dependencies they already knew remained valid.
On the afternoon of December 11, 2024, OpenAI deployed a new telemetry service across its production clusters. The intent was better observability of the Kubernetes control plane itself. The service was configured so that every node in each cluster executed a set of resource-intensive Kubernetes API operations on a regular interval. The cost of those operations scaled with the size of the cluster. On staging clusters, the cost was small. On the largest production clusters, with thousands of nodes, the aggregate load was enough to saturate the API servers within minutes of the rollout.
The cascade was straightforward once the control plane saturated. Pod scheduling stalled. Health checks took too long to complete. DNS-based service discovery, which depended on the same Kubernetes control plane, broke for the rest of the workloads in the cluster. Applications could not find their dependencies. ChatGPT began returning errors at 23:16 UTC. The API returned 503s. Sora stopped accepting requests. Engineering identified the deployment as the trigger within minutes and decided to roll it back.
The rollback was where the incident stopped being short. Removing a Kubernetes workload normally means asking the API server to do it. The API server was the thing that had failed. Standard kubectl commands could not get through, or got through and timed out. The team had to find direct access paths that bypassed the API server to remove the telemetry workload from the largest clusters by hand. As the offending operations stopped, the API server load came down and service discovery began to return. Pods started scheduling again, and the backlog of pending restarts drained gradually.
By 03:00 UTC on December 12, ChatGPT and the API were serving traffic again. Full resolution came at 03:38 UTC, four hours and twenty-two minutes after the first errors. A telemetry service was the trigger, but the durable lesson is the coupling: the workload's cost scaled with a production-only property of cluster size, the API server had no per-workload rate limit to protect itself, and the rollback path ran through the system that had failed. Any one of those conditions might have been recoverable in minutes. Together they turned a monitoring deployment into a four-hour outage.
A new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.// OpenAI, December 11, 2024 incident report
From the first signal to all-clear in 4h 22m.
A telemetry workload that treated the control plane as a data plane.
OpenAI deployed a new telemetry service across its production Kubernetes clusters to collect detailed control plane metrics. The service's configuration caused every node in each cluster to execute resource- intensive Kubernetes API operations whose cost scaled with the size of the cluster. On OpenAI's largest clusters — thousands of nodes — the aggregate load overwhelmed the API servers. The control plane went from healthy to saturated within minutes of the deployment.
Once the API servers were saturated, the rest of the system unraveled in expected ways. Pod scheduling stalled. Health checks could not complete. DNS-based service discovery, which depended on the same control plane, broke. The application layer hollowed out faster than the control plane issue could be diagnosed: pods failing health checks did not get restarted, and services that needed to resolve new endpoints could not. ChatGPT, the API, and Sora went offline.
The rollback path went through the same control plane that had just failed. Removing a Kubernetes workload normally means asking the API server to do it; with the API server saturated, the normal path was not available. The team had to reach for direct access paths that bypassed the API server to remove the telemetry workload from the largest clusters before the control plane could recover. The time spent finding and using those break-glass paths is most of why an incident detected in minutes took more than four hours to clear.