NIXL KV Transfer

This example demonstrates topologically-aware GPU + RDMA NIC allocation using Kubernetes Dynamic Resource Allocation (DRA). The workload uses NIXL over UCX/RDMA to copy a GPU-resident buffer between two pods running on two GPU nodes. The buffer is sized like an inference KV-cache handoff, so the result isolates the RDMA transfer path used by disaggregated prefill/decode serving without requiring a full vLLM router/model stack.

Both GPUs and NICs are allocated through DRA: GPUs via the gpu.nvidia.com DeviceClass and RDMA NICs via DRANET’s dranet.net DeviceClass. By keeping the same 4-GPU set across runs and changing only the NIC NUMA placement, the example measures how same-NUMA versus cross-NUMA GPU/NIC alignment affects KV transfer bandwidth and latency.

Full source: examples/nixl-kv-transfer.

Tested topology

The included templates were tested on two GPU nodes with:

Resource	Count	Detail
GPU	8 x NVIDIA H100	80 GB HBM3 each
NIC	8 x Mellanox ConnectX VF	RDMA-capable
NUMA nodes	2	4 GPU + 4 NIC per NUMA node

The manifests are intentionally cloud-provider agnostic. The checked-in ResourceClaimTemplate values match the tested 8-GPU H100 topology; adapt the GPU pciBusID and NIC numaNode selectors to your hardware before running on another SKU. Some GPU DRA drivers publish GPU pciBusID but not GPU numaNode, so this example selects GPUs by pciBusID; DRANET publishes NIC numaNode directly.

ResourceClaimTemplates

The core DRA manifests are two ResourceClaimTemplates. Both keep compute fixed (the same 4 GPUs on NUMA 0) and keep the aggregate NIC count fixed (4 NICs). The only intended difference is whether each visible GPU reaches a same-NUMA or remote-NUMA NIC.

Template	GPU selection	NIC selection	Purpose
`h100-4gpu-4nic-numa-aligned`	4 GPUs on NUMA 0	4 NICs on NUMA 0	Same-NUMA GPU/NIC path
`h100-4gpu-4nic-numa-unaligned`	Same 4 GPUs on NUMA 0	4 NICs on NUMA 1	Cross-NUMA GPU/NIC path

resource-claim-template-aligned.yaml selects NUMA-aligned GPUs and NICs:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: h100-4gpu-4nic-numa-aligned
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 4
          selectors:
          - cel:
              expression: 'device.attributes["resource.kubernetes.io"]["pciBusID"] in ["0001:00:00.0","0002:00:00.0","0003:00:00.0","0008:00:00.0"]'
      - name: nic
        exactly:
          deviceClassName: dranet.net
          count: 4
          selectors:
          - cel:
              expression: 'device.attributes["dra.net"]["rdma"] == true && device.attributes["dra.net"]["numaNode"] == 0'

resource-claim-template-unaligned.yaml uses the same GPUs but cross-NUMA NICs (note numaNode == 1):

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: h100-4gpu-4nic-numa-unaligned
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
          count: 4
          selectors:
          - cel:
              expression: 'device.attributes["resource.kubernetes.io"]["pciBusID"] in ["0001:00:00.0","0002:00:00.0","0003:00:00.0","0008:00:00.0"]'
      - name: nic
        exactly:
          deviceClassName: dranet.net
          count: 4
          selectors:
          - cel:
              expression: 'device.attributes["dra.net"]["rdma"] == true && device.attributes["dra.net"]["numaNode"] == 1'

Run

Apply everything via kustomize. This creates both ResourceClaimTemplates, the nixl-benchmark ConfigMap (generated from nixl_benchmark.py and run_bench.sh), the headless Service, and both pods:

kubectl apply -k .

The pods default to the h100-4gpu-4nic-numa-aligned template. The initiator and target manifests, the headless Service, the benchmark, and the run script are in the example directory. The initiator log contains one RESULT JSON object with avg_GBps, avg_seconds, and the p50/p95/p99 latencies.

Verify allocation

Confirm that only the allocated RDMA devices are visible inside each pod:

kubectl get resourceclaims -o yaml | grep -E 'name:|device:|driver:|request:'
kubectl exec nixl-kv-initiator -- ls /dev/infiniband
kubectl exec nixl-kv-target -- ls /dev/infiniband

In the aligned case the visible NICs are NODE-local to the selected GPUs; in the unaligned case they are SYS/cross-NUMA.

Benchmark results

Observed on the tested 8 x H100 node topology with 1 GiB NIXL WRITE transfers, 20 warmup iterations, and 100 timed iterations per run (three-run mean). Full results are in the example README.

Template	NICs	Avg bandwidth	Avg latency
`h100-4gpu-4nic-numa-aligned`	`mlx5_0`..`mlx5_3`	39.07 GB/s	27.49 ms
`h100-4gpu-4nic-numa-unaligned`	`mlx5_4`..`mlx5_7`	27.54 GB/s	38.99 ms

Same GPUs, same NIC count, same transfer size: same-NUMA GPU/NIC allocation delivers about 1.42x higher bandwidth and about 29.5% lower latency for this transfer. This is the inference KV-cache handoff path, so the bandwidth gap surfaces as decode tail latency under concurrency.

Last modified June 25, 2026: docs: add AWS EFA, Azure AKS, distributed training, and NIXL examples to the website (7a13a99)