Tools: How to Configure Bare-Metal Kubernetes for GPU Orchestration

Tools: How to Configure Bare-Metal Kubernetes for GPU Orchestration

Prerequisites

Quick Summary / TL;DR

Step-by-Step Guide

Step 1: Install NVIDIA Drivers on the Host Node

Step 2: Install the NVIDIA Container Toolkit

Step 3: Configure containerd for GPU Support

Step 4: Deploy the NVIDIA Device Plugin for Kubernetes

Step 5: Test GPU Allocation with a Pod To achieve maximum performance for AI inference, machine learning training, and high-performance computing (HPC), deploying workloads on bare-metal servers is the industry standard. Virtualized environments introduce overhead; bare-metal hardware allows direct access to the PCIe bus, ensuring your NVIDIA GPUs operate at 100% efficiency. This tutorial explains how to configure a bare-metal Kubernetes (K8s) cluster for GPU orchestration. By integrating the NVIDIA Container Toolkit and the Kubernetes Device Plugin, you can automatically schedule, allocate, and manage GPU resources across your containerized workloads. Before beginning, ensure your environment meets the following requirements: If you need a quick overview of the deployment pipeline: Kubernetes cannot interact with the GPU hardware without the host machine first having the correct drivers installed. Update your package lists and install necessary build tools: Install the recommended NVIDIA driver for your hardware: Reboot the server. Once back online, verify the installation by checking the GPU status: (Tip: You should see a table showing your GPU UUID, driver version, and CUDA version). The NVIDIA Container Toolkit allows containerd to pass GPU access directly to containers. Setup the package repository and GPG key: Update the repository and install the toolkit: You must explicitly tell containerd to use the NVIDIA runtime so Kubernetes can properly launch GPU-enabled Pods. Pro Tip: Configuring container runtimes and compiling drivers on inconsistent hardware can lead to frustrating kernel panics. Starting with a standardized environment—like a pre-configured GPUYard Bare Metal Dedicated Server—ensures you have the unthrottled PCIe lanes and clean OS images necessary to skip hardware debugging and move straight to orchestrating your AI workloads. Configure the NVIDIA runtime in containerd: Open the configuration file to ensure SystemdCgroup = true is set, which is required by modern Kubernetes: Restart containerd to apply the changes: The NVIDIA Device Plugin runs as a DaemonSet across your cluster. It constantly monitors the node's GPU capacity and exposes it to the kubelet, allowing the Kubernetes scheduler to track available GPUs. Apply the official NVIDIA Device Plugin YAML from your master node: Verify that the DaemonSet pods are running securely: Check if your node is correctly advertising GPU capacity: You should see an output indicating the exact number of GPUs available for allocation. Finally, deploy a test workload to ensure the Kubernetes scheduler successfully grants GPU access to a container. Create a file named gpu-pod.yaml: Apply the configuration: Check the Pod's logs to confirm it executed nvidia-smi successfully from inside the K8s cluster: You have successfully configured a bare-metal Kubernetes environment to recognize, manage, and allocate NVIDIA GPUs. By laying down the host drivers, linking containerd via the NVIDIA Container Toolkit, and orchestrating it all with the K8s Device Plugin, your cluster is now ready to handle intensive AI inference and ML training workloads with zero virtualization overhead. For enterprise-grade reliability and uncompromised raw computing power, consider deploying your next Kubernetes cluster on GPUYard. Explore our high-performance Bare Metal Dedicated Servers to build a resilient, scalable, and highly available infrastructure tailored specifically for AI orchestration. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y build-essential linux-headers-$(uname -r) -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y build-essential linux-headers-$(uname -r) -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y build-essential linux-headers-$(uname -r) -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-driver-535 -weight: 500;">curl -fsSL [https://nvidia.github.io/libnvidia-container/gpgkey](https://nvidia.github.io/libnvidia-container/gpgkey) | -weight: 600;">sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L [https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list](https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list) | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ -weight: 600;">sudo tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 500;">curl -fsSL [https://nvidia.github.io/libnvidia-container/gpgkey](https://nvidia.github.io/libnvidia-container/gpgkey) | -weight: 600;">sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L [https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list](https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list) | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ -weight: 600;">sudo tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 500;">curl -fsSL [https://nvidia.github.io/libnvidia-container/gpgkey](https://nvidia.github.io/libnvidia-container/gpgkey) | -weight: 600;">sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L [https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list](https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list) | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ -weight: 600;">sudo tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-container-toolkit -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-container-toolkit -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install -y nvidia-container-toolkit -weight: 600;">sudo nvidia-ctk runtime configure --runtime=containerd -weight: 600;">sudo nvidia-ctk runtime configure --runtime=containerd -weight: 600;">sudo nvidia-ctk runtime configure --runtime=containerd -weight: 600;">sudo nano /etc/containerd/config.toml -weight: 600;">sudo nano /etc/containerd/config.toml -weight: 600;">sudo nano /etc/containerd/config.toml -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">restart containerd -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">restart containerd -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">restart containerd -weight: 500;">kubectl create -f [https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml) -weight: 500;">kubectl create -f [https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml) -weight: 500;">kubectl create -f [https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml) -weight: 500;">kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -weight: 500;">kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -weight: 500;">kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -weight: 500;">kubectl describe node <your-node-name> | grep -i [nvidia.com/gpu](https://nvidia.com/gpu) -weight: 500;">kubectl describe node <your-node-name> | grep -i [nvidia.com/gpu](https://nvidia.com/gpu) -weight: 500;">kubectl describe node <your-node-name> | grep -i [nvidia.com/gpu](https://nvidia.com/gpu) apiVersion: v1 kind: Pod metadata: name: gpu-test-pod spec: restartPolicy: OnFailure containers: - name: cuda-container image: nvidia/cuda:12.2.0-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: [nvidia.com/gpu](https://nvidia.com/gpu): 1 apiVersion: v1 kind: Pod metadata: name: gpu-test-pod spec: restartPolicy: OnFailure containers: - name: cuda-container image: nvidia/cuda:12.2.0-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: [nvidia.com/gpu](https://nvidia.com/gpu): 1 apiVersion: v1 kind: Pod metadata: name: gpu-test-pod spec: restartPolicy: OnFailure containers: - name: cuda-container image: nvidia/cuda:12.2.0-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: [nvidia.com/gpu](https://nvidia.com/gpu): 1 -weight: 500;">kubectl apply -f gpu-pod.yaml -weight: 500;">kubectl apply -f gpu-pod.yaml -weight: 500;">kubectl apply -f gpu-pod.yaml -weight: 500;">kubectl logs gpu-test-pod -weight: 500;">kubectl logs gpu-test-pod -weight: 500;">kubectl logs gpu-test-pod - Operating System: Ubuntu 22.04 LTS (Jammy Jellyfish). - Hardware: A bare-metal server with at least one physical NVIDIA GPU attached. - Access: Root or -weight: 600;">sudo privileges. - Kubernetes: A running K8s cluster (v1.25+) initialized via kubeadm, k3s, or similar, with the -weight: 500;">kubectl CLI tool configured. - Container Runtime: containerd installed and running. - Update the Host: Install the proprietary NVIDIA GPU drivers directly on the bare-metal node. - Install Toolkit: Deploy the NVIDIA Container Toolkit to bridge the GPU with container runtimes. - Configure Runtime: Modify containerd configurations to recognize the nvidia runtime class. - Deploy Plugin: Apply the NVIDIA Device Plugin DaemonSet to your K8s cluster. - Verify: Deploy a test Pod requesting nvidia.com/gpu resources to confirm successful orchestration.