Key lessons from scaling Kubernetes to 7500 nodes for AI training | Vishal Chandra


Vishal Chandra
Mar 28, 2023

Key lessons from scaling Kubernetes to 7500 nodes for AI training

Training LLM models with billions of parameters requires a large infrastructure and coordination of code across 1000s of machines. Recently OpenAI shared how they scaled Kubernetes cluster to 7500 nodes to address this challenge.


Twelve key lessons from their experience, specific to AI training, are as follows:

1. Single pod on entire Node

A large machine learning job spans many nodes and runs most efficiently with access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for most of the workloads, a single pod occupies the entire node.

2. Pods should be considered as semi-stateful

The biggest jobs run MPI, and all pods within the job are participating via a single MPI communicator. If any of the participating pods die, the entire job halts and needs to be restarted. The job does checkpoint regularly and when restarted it resumes from the last checkpoint. So pods should be considered semi-stateful and pod replacement should be minimised since it requires restart from the last checkpoint.

3. IP based networking seems to work better

Flannel, a layer 3 network fabric for Kubernetes, does not scale throughput properly with increase in the number of pods. Native pod networking technologies work better and allow you to get host level network throughput on the pods. Rather than using routes, direct IP based networking seems to work better. Another way to look at this is that if there's a single pod on most nodes, networking is effectively between the nodes.

4. API errors indicate bigger problems

Kubernetes API Servers and etcd are critical components of a healthy working cluster. It is useful to set alerts on the rate of HTTP status 429 (Too Many Requests) and 5xx (Server Error) on the API Servers as they indicate high-level signal of problems.

5. Run API servers outside the cluster

Both etcd and API servers run on their own dedicated nodes. OpenAI's largest clusters run 5 API servers and 5 etcd nodes to spread the load and minimize impact if one were to ever go down. There's been no notable trouble with etcd since splitting out Kubernetes Events into their own etcd cluster 

6. Observability events can add up quickly

One big strain on API Servers was WATCHes on Endpoints. When a node would be added or removed from the cluster, this WATCH would fire. And because typically each node itself was watching the kubelet service via kube-proxy, the # and bandwidth required in these responses would be N^2 and enormous, occasionally 1GB/s or more. EndpointSlices, launched in Kubernetes 1.17, were a huge benefit that brought this load down 1000x.

7. Caching can help avoid cluster-wide bottlenecks

On occasion there may be API Server requests that scale with the size of the cluster, but these should be minimised. For example, avoid having any DaemonSets interact with the API Server. In cases where you do need each node to watch for changes, introducing an intermediary caching service, such as the Datadog Cluster Agent, seems to be a good pattern to avoid cluster-wide bottlenecks.

8. Autoscaling should be more gradual

Once cluster size is stable there are usually not many issues, but sometimes it is possible to autoscale too much at once. There are many requests generated when a new node joins a cluster, and adding hundreds of nodes at once can overload API server capacity. Smoothing this out, even just by a few seconds, will help avoid outages.

9. Monitoring doesn't scale effectively

For a large cluster, metrics data will get huge, and Prometheus can be made to crash because of Grafana API requests. Prometheus can be patched to contain the API request within a Context to enforce a timeout, which can fix this.

So Prometheus will crash much less but when it does we and we need to restart it, WAL replay remained an issue. It can take many hours to replay through all WAL logs before Prometheus starts collecting new metrics and servicing queries. With help from Robust Perception, OpenAI found that applying a GOMAXPROCS=24 had a big improvement. Prometheus tries to use all cores when during WAL replay, and for servers with a large number of cores, the contention kills all performance.

10. Automated healthchecks to detect issues with GPU in Nodes

GPUs exhibit problems a number of different ways, but an easy common one is an “Uncorrectable ECC error.”  Nvidia’s Data Center GPU Manager (DCGM) tools make it easy to query for this and a number of other “Xid” errors. But this is not sufficient, since not all GPU problems manifest as error codes visible through DCGM.

OpenAI solved this by building own library of tests that exercise GPUs to catch additional problems and ensure that the hardware and driver is behaving as expected. The downside is that these tests can’t be run in the background and require exclusive use of a GPU for several seconds or minutes to run.

11. Resource sharing for workloads by different teams

OpenAI runs a service in each cluster, “team-resource-manager” that has multiple functions. Its data source is a ConfigMap that specifies tuples of (node selector, team label to apply, allocation amount) for different teams that have capacity in a given cluster.

Using taints allows them to constrain the Kubernetes pod scheduler flexibly, such as allowing a “any” toleration for lower priority pods, which allows teams to borrow each other’s capacity without requiring heavyweight coordination.

A different issue is if two experiments each requested 100% of the cluster’s capacity, instead of scheduling all of one experiment or the other, Kubernetes might schedule only half of each experiment’s pods, leading to a deadlock where neither experiment can make progress. OpenAI solved this with help of the Coscheduling plugin:

12. Cluster auto-scaling needs buffer workloads

The challenge with cluster-autoscaler is that if it sees idle nodes, it will attempt to scale down to only needed capacity. For multiple reasons (VM spin up latency, pre-allocated costs, the API server impacts mentioned above) this idle-scaling isn’t ideal.

OpenAI solved this by creating a Deployment that contains a ReplicaSet with “max size” number of low-priority pods. These pods occupy resources within a node, so the autoscaler doesn’t consider them as idle. However since they’re low priority, the scheduler can evict them immediately to make room for actual work.

A related point is the use of pod anti-affinity to ensure the pods would evenly distribute across the nodes, which have a better performance since Kubernetes 1.18.

Can we get to 100,000 Nodes?

Not all problems are however fully solved, and in particular metrics can be better scaled and pod network traffic needs to be better shaped and made efficient.

Kubernetes technology needs to be improved so that can be more efficiently scaled to even 100,000 nodes, without having to switch to light clients etc like supported by some of the Edge versions of Kubernetes. The strong focus on scaling AI creates the necessity for improving this technology.

You can read the whole article for a deep dive into the details here: