Training LLM models with billions of parameters requires a large infrastructure and coordination of code across 1000s of machines. Recently OpenAI shared how they scaled Kubernetes cluster to 7500 nodes to address this challenge.
Twelve key lessons from their experience, specific to AI training, are as follows:
A large machine learning job spans many nodes and runs most efficiently with access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for most of the workloads, a single pod occupies the entire node.
The biggest jobs run MPI, and all pods within the job are participating via a single MPI communicator. If any of the participating pods die, the entire job halts and needs to be restarted. The job does checkpoint regularly and when restarted it resumes from the last checkpoint. So pods should be considered semi-stateful and pod replacement should be minimised since it requires restart from the last checkpoint.
Flannel, a layer 3 network fabric for Kubernetes, does not scale throughput properly with increase in the number of pods. Native pod networking technologies work better and allow you to get host level network throughput on the pods. Rather than using routes, direct IP based networking seems to work better. Another way to look at this is that if there's a single pod on most nodes, networking is effectively between the nodes.
Kubernetes API Servers and etcd are critical components of a healthy working cluster. It is useful to set alerts on the rate of HTTP status 429 (Too Many Requests) and 5xx (Server Error) on the API Servers as they indicate high-level signal of problems.
Both etcd and API servers run on their own dedicated nodes. OpenAI's largest clusters run 5 API servers and 5 etcd nodes to spread the load and minimize impact if one were to ever go down. There's been no notable trouble with etcd since splitting out Kubernetes Events into their own etcd cluster
One big strain on API Servers was WATCHes on Endpoints. When a node would be added or removed from the cluster, this WATCH would fire. And because typically each node itself was watching the kubelet service via kube-proxy, the # and bandwidth required in these responses would be N^2 and enormous, occasionally 1GB/s or more. EndpointSlices, launched in Kubernetes 1.17, were a huge benefit that brought this load down 1000x.
On occasion there may be API Server requests that scale with the size of the cluster, but these should be minimised. For example, avoid having any DaemonSets interact with the API Server. In cases where you do need each node to watch for changes, introducing an intermediary caching service, such as the Datadog Cluster Agent, seems to be a good pattern to avoid cluster-wide bottlenecks.
Once cluster size is stable there are usually not many issues, but sometimes it is possible to autoscale too much at once. There are many requests generated when a new node joins a cluster, and adding hundreds of nodes at once can overload API server capacity. Smoothing this out, even just by a few seconds, will help avoid outages.
For a large cluster, metrics data will get huge, and Prometheus can be made to crash because of Grafana API requests. Prometheus can be patched to contain the API request within a Context to enforce a timeout, which can fix this.
So Prometheus will crash much less but when it does we and we need to restart it, WAL replay remained an issue. It can take many hours to replay through all WAL logs before Prometheus starts collecting new metrics and servicing queries. With help from Robust Perception, OpenAI found that applying a GOMAXPROCS=24 had a big improvement. Prometheus tries to use all cores when during WAL replay, and for servers with a large number of cores, the contention kills all performance.
GPUs exhibit problems a number of different ways, but an easy common one is an “Uncorrectable ECC error.” Nvidia’s Data Center GPU Manager (DCGM) tools make it easy to query for this and a number of other “Xid” errors. But this is not sufficient, since not all GPU problems manifest as error codes visible through DCGM.
OpenAI solved this by building own library of tests that exercise GPUs to catch additional problems and ensure that the hardware and driver is behaving as expected. The downside is that these tests can’t be run in the background and require exclusive use of a GPU for several seconds or minutes to run.
OpenAI runs a service in each cluster, “team-resource-manager” that has multiple functions. Its data source is a ConfigMap that specifies tuples of (node selector, team label to apply, allocation amount) for different teams that have capacity in a given cluster.
Using taints allows them to constrain the Kubernetes pod scheduler flexibly, such as allowing a “any” toleration for lower priority pods, which allows teams to borrow each other’s capacity without requiring heavyweight coordination.
A different issue is if two experiments each requested 100% of the cluster’s capacity, instead of scheduling all of one experiment or the other, Kubernetes might schedule only half of each experiment’s pods, leading to a deadlock where neither experiment can make progress. OpenAI solved this with help of the Coscheduling plugin: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md
The challenge with cluster-autoscaler is that if it sees idle nodes, it will attempt to scale down to only needed capacity. For multiple reasons (VM spin up latency, pre-allocated costs, the API server impacts mentioned above) this idle-scaling isn’t ideal.
OpenAI solved this by creating a Deployment that contains a ReplicaSet with “max size” number of low-priority pods. These pods occupy resources within a node, so the autoscaler doesn’t consider them as idle. However since they’re low priority, the scheduler can evict them immediately to make room for actual work.
A related point is the use of pod anti-affinity to ensure the pods would evenly distribute across the nodes, which have a better performance since Kubernetes 1.18.
Not all problems are however fully solved, and in particular metrics can be better scaled and pod network traffic needs to be better shaped and made efficient.
Kubernetes technology needs to be improved so that can be more efficiently scaled to even 100,000 nodes, without having to switch to light clients etc like supported by some of the Edge versions of Kubernetes. The strong focus on scaling AI creates the necessity for improving this technology.
You can read the whole article for a deep dive into the details here: https://openai.com/research/scaling-kubernetes-to-7500-nodes
Heirarchial Consensus by Protocol Labs Hierarchical consensus is a novel approach to blockchain scaling that centers on the concept of subnets which are organized hierarchically and can be spawned on-demand to manage new state.
Handy terminal commands 1. To give approval to open an app on MacOS that Apple is not allowing you to. Here is the error you will see: "App name" is damaged and cannot be opened. You should move it to the bin. Command to allow the app to be opened: xattr -d com.apple.quarantine /path/to/app.app IMP: Do this only for trusted apps.