Distributed Training with NUMA-aligned GPUs and NICs

This example runs a PyTorch DistributedDataParallel (DDP) benchmark over NCCL and reports Model FLOPs Utilization (MFU). It follows the DRA + MPIJob pattern and compares NUMA-aligned vs NUMA-unaligned NIC placement for a 4 GPU / 4 NIC run on H100 nodes.

The full source, including the launcher scripts, the benchmark, and both MPIJob manifests, lives at examples/distributed_training.

Why NUMA alignment matters

On a multi-socket GPU node, both the GPUs and the RDMA NICs attach to specific NUMA nodes via the PCIe topology. When a GPU and the NIC it uses for inter-node communication sit on the same NUMA node, traffic stays local to that socket (NODE path in nvidia-smi topo -m). When they sit on different NUMA nodes, traffic crosses the inter-socket interconnect (SYS path), adding latency and reducing bandwidth.

This example demonstrates the effect by allocating the same four GPUs in both runs and varying only which four NICs are bound to them. DRANET makes this controllable because it publishes per-device attributes (including rdma and numaNode) that you select against in a ResourceClaimTemplate using CEL.

DeviceClass

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: dranet.net
spec:
  selectors:
  - cel:
      expression: device.driver == "dra.net"

ResourceClaimTemplates

Both templates request the same four GPUs (selected by pciBusID) and four NICs requiring rdma == true, differing only in the NIC numaNode:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: h100-4gpu-4nic-numa-aligned
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 4
          selectors:
          - cel:
              expression: >-
                device.attributes["resource.kubernetes.io"]["pciBusID"] in
                ["0001:00:00.0", "0002:00:00.0", "0003:00:00.0", "0008:00:00.0"]                
      - name: nic
        exactly:
          deviceClassName: dranet.net
          count: 4
          selectors:
          - cel:
              expression: >-
                device.attributes["dra.net"]["rdma"] == true &&
                device.attributes["dra.net"]["numaNode"] == 0                
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: h100-4gpu-4nic-numa-unaligned
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 4
          selectors:
          - cel:
              expression: >-
                device.attributes["resource.kubernetes.io"]["pciBusID"] in
                ["0001:00:00.0", "0002:00:00.0", "0003:00:00.0", "0008:00:00.0"]                
      - name: nic
        exactly:
          deviceClassName: dranet.net
          count: 4
          selectors:
          - cel:
              expression: >-
                device.attributes["dra.net"]["rdma"] == true &&
                device.attributes["dra.net"]["numaNode"] == 1

The two MPIJob manifests (mpi-job-4gpu-4nic-numa-aligned.yaml / -unaligned.yaml) and the launcher, worker, and benchmark scripts are in the example directory. The aligned and unaligned manifests are otherwise identical; they differ only in which ResourceClaimTemplate they reference.

Running the example

Install the MPI Operator if the cluster does not already have the MPIJob CRD:

kubectl apply --server-side -k "https://github.com/kubeflow/mpi-operator/manifests/overlays/standalone?ref=v0.7.0"

kubectl apply -k . builds the benchmark ConfigMap and applies the device class and templates. Then apply one of the MPIJobs:

kubectl apply -k .
kubectl apply -f mpi-job-4gpu-4nic-numa-aligned.yaml

Repeat with mpi-job-4gpu-4nic-numa-unaligned.yaml for the unaligned run. Each launcher prints a final MFU_RESULT line.

Results

Cluster: 2 x Standard_ND96isr_H100_v5 workers, 4 H100 GPUs per worker, 4 RDMA NICs per worker, BF16 DDP over NCCL. Full methodology and per-step output are in the example README.

Case	Avg step	TFLOP/s per GPU	MFU
NUMA aligned	55.496 ms	356.62	36.06%
NUMA unaligned	60.951 ms	324.70	32.83%

The aligned run was ~8.95% faster per step and ~9.84% higher MFU, with no change except the NIC NUMA placement chosen by the ResourceClaimTemplate. That gap is the cost of crossing the inter-socket interconnect that NUMA-aware device selection avoids.

Last modified June 25, 2026: docs: add AWS EFA, Azure AKS, distributed training, and NIXL examples to the website (7a13a99)