<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>User Guides on DRANET</title><link>https://dranet.sigs.k8s.io/docs/user/</link><description>Recent content in User Guides on DRANET</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://dranet.sigs.k8s.io/docs/user/index.xml" rel="self" type="application/rss+xml"/><item><title>Ray on GKE using DRANET</title><link>https://dranet.sigs.k8s.io/docs/user/kuberay/</link><pubDate>Mon, 14 Jul 2025 10:10:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/kuberay/</guid><description>&lt;p>To get started, follow the instructions to create a &lt;a href="https://dranet.sigs.k8s.io/docs/user/gke-rdma">GKE cluster with DRA
support and using DRANET&lt;/a>, it is important to follow the
instructions, since there are multiple dependencies on the Kubernetes API
version, the RDMA NCCL installer and the DRANET component.&lt;/p>
&lt;p>The worker nodes in this configuration are a4-highgpu-8g instances, each equipped with eight NVIDIA B200 GPUs and eight RDMA-capable RoCE NICs.&lt;/p>
&lt;h3 id="deploy-raycluster">Deploy RayCluster&lt;/h3>
&lt;p>Install Ray CRDs and the KubeRay operator:&lt;/p></description></item><item><title>GKE with NVIDIA DRA and DRANET</title><link>https://dranet.sigs.k8s.io/docs/user/nvidia-dranet/</link><pubDate>Fri, 20 Jun 2025 10:10:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/nvidia-dranet/</guid><description>&lt;p>To get started, create a &lt;a href="https://cloud.google.com/kubernetes-engine/docs/how-to/set-up-dra">GKE cluster with DRA
support&lt;/a> and
the corresponding &lt;a href="https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-vpcs-and-subnets">VPC and
subnets&lt;/a>&lt;/p>
&lt;p>It should look like&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sh" data-lang="sh">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">PROJECT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;gke-dranet&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-dranet&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">REGION&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;us-west8&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">ZONE&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;us-west8-c&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-gvnic&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-rdma&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">VERSION&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;1.34&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud container clusters create &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --cluster-version&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">VERSION&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --enable-multi-networking &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --enable-dataplane-v2 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --no-enable-autorepair &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --no-enable-autoupgrade &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --zone&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --project&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create a VPC for the additional Google Titanium CPU NIC&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --subnet-mode&lt;span class="o">=&lt;/span>custom
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks subnets create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-sub &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --region&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">REGION&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --range&lt;span class="o">=&lt;/span>192.168.0.0/24
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> firewall-rules create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-internal &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --action&lt;span class="o">=&lt;/span>ALLOW &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --rules&lt;span class="o">=&lt;/span>tcp:0-65535,udp:0-65535,icmp &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --source-ranges&lt;span class="o">=&lt;/span>192.168.0.0/16
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create HPC VPC for the RDMA NICs with 8 subnets.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloudcompute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks create &lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network-profile&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-vpc-roce &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --subnet-mode&lt;span class="o">=&lt;/span>custom
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create subnets for the HPC VPC.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> N in &lt;span class="k">$(&lt;/span>seq &lt;span class="m">0&lt;/span> 7&lt;span class="k">)&lt;/span>&lt;span class="p">;&lt;/span> &lt;span class="k">do&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks subnets create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-sub-&lt;span class="nv">$N&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --region&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">REGION&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --range&lt;span class="o">=&lt;/span>192.168.&lt;span class="k">$((&lt;/span>N+1&lt;span class="k">))&lt;/span>.0/24 &lt;span class="p">&amp;amp;&lt;/span> &lt;span class="c1"># offset to avoid overlap with gvnics&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">done&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud container node-pools create dranet-a4 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --cluster &lt;span class="si">${&lt;/span>&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --project &lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --zone &lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --node-locations &lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --machine-type a4-highgpu-8g&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --accelerator &lt;span class="s2">&amp;#34;type=nvidia-b200,count=8,gpu-driver-version=default&amp;#34;&lt;/span> --num-nodes &lt;span class="s2">&amp;#34;2&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-0 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-1 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-2 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-3 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-4 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-5 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-6 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-7
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Apply the following DaemonSet to install the RDMA binaries and the NCCL library
on the node. The RDMA binaries are stored in &lt;code>/home/kubernetes/bin/gib&lt;/code>
directory and the NCCL library is stored in &lt;code>/home/kubernetes/bin/nvidia/lib64&lt;/code>
directory on the VM:&lt;/p></description></item><item><title>GKE and Cloud TPU v6e (Trillium)</title><link>https://dranet.sigs.k8s.io/docs/user/gke-tpu-performance/</link><pubDate>Tue, 27 May 2025 11:30:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/gke-tpu-performance/</guid><description>&lt;p>If you use TPU Trillium and you want to improve the network performance of your Pods you can balance your network traffic over the VM NICs.&lt;/p>
&lt;p>The &lt;code>ct6e-standard-4t&lt;/code> machine type is backed by two physical NICs, since the main interface of the VM is used for all the applications and Pods on the host, you can create two additional vNICs on the VM that will be attached to each of the physical NICs, and pass them to the Pod directly, so you can multiplex your traffic to consume the total capacity of the physical NICs.&lt;/p></description></item><item><title>GKE and GPUDirect RDMA with DRA</title><link>https://dranet.sigs.k8s.io/docs/user/gke-rdma/</link><pubDate>Tue, 27 May 2025 11:30:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/gke-rdma/</guid><description>&lt;p>On Google Cloud A3 Ultra and A4 machine types, you can utilize GPUDirect RDMA to run distributed AI workloads that require high performance networking support. To get started, create a &lt;a href="https://cloud.google.com/kubernetes-engine/docs/how-to/set-up-dra">GKE cluster with DRA support&lt;/a> and the corresponding &lt;a href="https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-vpcs-and-subnets">VPC and subnets&lt;/a> for the RDMA network for the A3Ultra or A4 Node Pools, the &lt;code>gcloud&lt;/code> commands should be something like:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-sh" data-lang="sh">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">PROJECT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;gke-dranet&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-dranet&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">REGION&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;us-west8&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">ZONE&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;us-west8-c&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-gvnic&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;dranet-rdma&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">VERSION&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;1.34&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud container clusters create &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --cluster-version&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">VERSION&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --enable-multi-networking &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --enable-dataplane-v2 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --no-enable-autorepair &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --no-enable-autoupgrade &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --zone&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --project&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create a VPC for the additional Google Titanium CPU NIC&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --subnet-mode&lt;span class="o">=&lt;/span>custom
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks subnets create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-sub &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --region&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">REGION&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --range&lt;span class="o">=&lt;/span>192.168.0.0/24
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> firewall-rules create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-internal &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --action&lt;span class="o">=&lt;/span>ALLOW &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --rules&lt;span class="o">=&lt;/span>tcp:0-65535,udp:0-65535,icmp &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --source-ranges&lt;span class="o">=&lt;/span>192.168.0.0/16
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create HPC VPC for the RDMA NICs with 8 subnets.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloudcompute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks create &lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network-profile&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-vpc-roce &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --subnet-mode&lt;span class="o">=&lt;/span>custom
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Create subnets for the HPC VPC.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> N in &lt;span class="k">$(&lt;/span>seq &lt;span class="m">0&lt;/span> 7&lt;span class="k">)&lt;/span>&lt;span class="p">;&lt;/span> &lt;span class="k">do&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> gcloud compute --project&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> networks subnets create &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-sub-&lt;span class="nv">$N&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --network&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span>-net &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --region&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">REGION&lt;/span>&lt;span class="p">?&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --range&lt;span class="o">=&lt;/span>192.168.&lt;span class="k">$((&lt;/span>N+1&lt;span class="k">))&lt;/span>.0/24 &lt;span class="p">&amp;amp;&lt;/span> &lt;span class="c1"># offset to avoid overlap with gvnics&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">done&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">gcloud container node-pools create dranet-a4 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --cluster &lt;span class="si">${&lt;/span>&lt;span class="nv">CLUSTER&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --project &lt;span class="si">${&lt;/span>&lt;span class="nv">PROJECT&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --zone &lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --node-locations &lt;span class="si">${&lt;/span>&lt;span class="nv">ZONE&lt;/span>&lt;span class="si">}&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --machine-type a4-highgpu-8g&lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --accelerator &lt;span class="s2">&amp;#34;type=nvidia-b200,count=8,gpu-driver-version=default&amp;#34;&lt;/span> --num-nodes &lt;span class="s2">&amp;#34;2&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">GVNIC_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-0 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-1 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-2 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-3 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-4 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-5 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-6 &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --additional-node-network &lt;span class="nv">network&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-net,subnetwork&lt;span class="o">=&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">RDMA_NETWORK_PREFIX&lt;/span>&lt;span class="si">}&lt;/span>-sub-7
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Apply the following DaemonSet to install the RDMA binaries and the NCCL library on the node. The RDMA binaries are stored in &lt;code>/home/kubernetes/bin/gib&lt;/code> directory and the NCCL library is stored in &lt;code>/home/kubernetes/bin/nvidia/lib64&lt;/code> directory on the VM:&lt;/p></description></item><item><title>MPI Operator on GKE and GPUDirect RDMA</title><link>https://dranet.sigs.k8s.io/docs/user/mpi-operator/</link><pubDate>Tue, 27 May 2025 11:30:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/mpi-operator/</guid><description>&lt;p>Running distributed applications, such as those using the Message Passing Interface (MPI) or NVIDIA&amp;rsquo;s Collective Communications Library (NCCL) for GPU communication, often requires each participating process (or Pod, in Kubernetes terms) to have access to high-speed, low-latency interconnects. Simply sharing a generic network interface among many high-performance jobs can lead to contention, unpredictable performance, and underutilization of expensive hardware.&lt;/p>
&lt;p>The goal is resource compartmentalization: ensuring that each part of your distributed job gets dedicated access to the specific resources it needs – for instance, one GPU and one dedicated RDMA-capable NIC per worker.&lt;/p></description></item><item><title>Interface Configuration</title><link>https://dranet.sigs.k8s.io/docs/user/interface-configuration/</link><pubDate>Sun, 25 May 2025 11:30:40 +0000</pubDate><guid>https://dranet.sigs.k8s.io/docs/user/interface-configuration/</guid><description>&lt;p>To configure network interfaces in DRANET, users can provide custom configurations through the parameters field of a ResourceClaim or ResourceClaimTemplate. This configuration adheres to the NetworkConfig structure, which defines the desired state for network interfaces and their associated routes.&lt;/p>
&lt;h3 id="network-configuration-overview">Network Configuration Overview&lt;/h3>
&lt;p>The primary structure for custom network configuration is NetworkConfig. It encompasses settings for the network interface itself and any specific routes and rules to be applied within the Pod&amp;rsquo;s network namespace.&lt;/p></description></item></channel></rss>