Tools: Splitting One GPU Across Multiple Kubernetes Pods — Without MIG, Without Enterprise Licenses - 2025 Update

Tools: Splitting One GPU Across Multiple Kubernetes Pods — Without MIG, Without Enterprise Licenses - 2025 Update

The Problem I've Been Carrying for Years

Our Homegrown Solution — Reinventing the Wheel (But It Spun)

KubeCon Europe 2026 — Amsterdam

2AM in the Hotel Room — The PoC That Shouldn't Have Happened

What I Actually Built at 2AM — The PoC

Step 1 — Enable GPU Support and Install HAMi

Step 2 — Two Workers Sharing One GPU

The Output That Made Me Pump My Fist at 2AM

The Gotchas — Things That Would Have Wrecked Me Without Debugging

Gotcha #1: nvidia.com/gpu: "1" is mandatory

Gotcha #2: The device-plugin DaemonSet starts with DESIRED: 0

Gotcha #3: bind-phase stuck at "allocating"

HAMi vs Our Homebrew Docker Approach — An Honest Comparison

What Plain Docker Actually Does (and Doesn't Do)

What HAMi Actually Does

The Comparison Table

HAMi's Built-In Monitoring — This Part Surprised Me

Understanding the Limits — Soft vs Hard

What This Means for My Platform

Try It Yourself — Full PoC Files

bogdancstrike / HAMi-kubernetes-gpu-partitioning-demo

HAMi GPU Sharing on MicroK8s

How It Works

Prerequisites

Final Thoughts — From One Infrastructure Nerd to Another A years-old GPU frustration, a conference discovery, and a 2AM PoC that actually worked If you work with AI or video at scale and you're not at one of the big hyperscalers, you've probably hit this wall before: you have GPUs, and you're wasting them. Not because your workloads don't need GPU — they do. But because individually, each workload is small. AI inference services rarely saturate a whole card. Processing jobs spin up, eat some compute, and die. Embedding models, classifiers, lightweight LLMs — they each need a slice of a GPU, not the whole thing. None of them come close to maxing out the hardware on their own. And yet, in a typical Kubernetes setup, each one claims an entire GPU card and sits there hoarding it while the rest goes to waste. I've been building a platform that runs multiple AI and video processing workloads in parallel — inference services, enrichment pipelines, on-demand processing jobs. The kind of system where a lot of different things need GPU access at the same time, but no single one of them needs a whole card. The stack is K8s, Kafka, Redis, some databases a handful of Python and Java services. And GPUs — always the GPUs. The GPU problem specifically: we have T4s and L40S nodes, and we could never properly share them between pods without playing with fire. NVIDIA Time-Slicing — Easy to set up via the GPU Operator and it looks good on paper. In practice, for streaming and transcoding workloads it was a non-starter. Time-slicing serialises GPU access, which introduces jitter and latency spikes — exactly what you cannot have when you're processing live video or audio. Frames drop, buffers stall, quality degrades. We turned it off fast. Plain Docker with --gpus device=0 — Which I'll get into. We actually used this for a long time, and it worked — sort of. So we did what any team does when the tooling isn't there: we built it ourselves. A while back, my team built an internal orchestration layer around a simple reality: Kubernetes GPU support was too coarse for what we needed, so we worked around it. The split was straightforward in concept: CPU-based tasks ran as K8s pods, GPU-based tasks ran as Docker containers. Everything that didn't need a GPU lived happily in the cluster as proper Kubernetes workloads — ETL pipelines, API services, data processing, the full stack. But the moment a task needed GPU, it stepped outside K8s entirely. For GPU tasks, we had a purpose-built orchestrator service. This orchestrator had to run on the same node as the GPU — because it talked to the local Docker daemon directly to spin up containers there. We enforced this with node affinity rules, pinning the orchestrator to the GPU node so it could reach the Docker API and launch containers on that specific machine. When a GPU task came in, the orchestrator started a Docker container with --gpus device=N, the task ran, the container was torn down. All GPU-based AI work happened this way — plain Docker containers on the GPU node, completely outside Kubernetes. It worked. We ran it in production. The team was proud of it — and honestly, it was solid engineering given the constraints. But the problems were always there: We knew it was technical debt. We just couldn't find anything better. Until Amsterdam. I went to KubeCon this year primarily to answer one question: is there something in the cloud-native ecosystem that handles sub-GPU partitioning on lower-end hardware without requiring H100s? The talks were good. The hallway track was better. I had conversations with people from platform teams at AI startups, SaaS companies, and a few cloud providers. The picture that emerged was clear — and a little frustrating. The majority weren't even wrestling with this problem. They were on cloud, spinning up GPU instances on demand, scaling out horizontally whenever they needed more compute. GPU sharing? Why bother when you can just add another node? But for teams running on-prem or on fixed GPU budgets — and there were more of us in that room than the cloud-native crowd might assume — the story was different. We either wasted GPU resources with whole-GPU-per-pod allocations, paid the H100 tax to get MIG, or built our own solutions. Same wall, different paint. I attended a session on GPU resource management and heard mentions of several tools — GPU Operator, DRA (Dynamic Resource Allocation, which is still maturing in K8s 1.31/1.32), KAI Scheduler, and then something I hadn't heard of before: HAMi. There was a session that stopped me mid-scroll: "Dynamic, Smart, Stable GPU-Sharing Middleware In Kubernetes". Five minutes in I had stopped taking notes on anything else. The talk walked through exactly the problem I'd been living with — sub-GPU partitioning on hardware that doesn't support MIG — and presented HAMi as the answer. Software-level vGPU, hard VRAM isolation, any CUDA GPU, K8s native. What made it land even harder was that HAMi had also been mentioned earlier in the keynotes. Not as a footnote — as a legitimate part of the GPU sharing story on Kubernetes. The city kept us out until midnight. Amsterdam will do that. I said goodbye to everyone, walked back to the hotel, and should have gone straight to sleep — full day of sessions, a lot of walking, and an early morning talk the next day. Instead I opened the laptop. I'd been turning HAMi over in my head since that session. Not casually — obsessively. I had my MicroK8s home lab accessible remotely. I had a GPU sitting idle. I had all the context from the past year of fighting this exact problem loaded in my head. I genuinely could not wait until I got home to try it. The idea of going to sleep without at least attempting the install felt physically uncomfortable in the way only a very specific kind of engineering nerd will understand. So there I was, at 2AM Amsterdam time, laptop on the hotel desk, SSH tunnel back home, microk8s helm3 repo add running. Extremely classic. Three hours later I had a working HAMi installation, two pods running on the same physical GPU with separate VRAM slices, and nvidia-smi showing exactly what I'd spent two years trying to achieve. I didn't go to sleep until I saw that output. Totally worth it. Let me tell you what HAMi actually is, because the name is a bit opaque. HAMi (Heterogeneous AI Computing Virtualization Middleware) — formerly known as k8s-vGPU-scheduler — is a CNCF Sandbox project that provides software-level GPU virtualization for Kubernetes. It works on any CUDA GPU, including your T4s, RTX cards, L40S, and others that don't support MIG. The core mechanism is elegant: HAMi injects a shared library (libvgpu.so) into each container via LD_PRELOAD. This library intercepts every cudaMalloc call at the CUDA API level. If your pod's cumulative VRAM allocation would exceed its configured limit, HAMi returns CUDA_ERROR_OUT_OF_MEMORY — a hard wall. The pod dies. The other pods sharing that GPU are completely unaffected. This is fundamentally different from every other approach: I tested this on my home lab machine (NVIDIA GeForce RTX 3080, 10GB VRAM) running MicroK8s. Here's the full stack I set up, with the actual files I used. Gotcha #1: The gpu=on label is mandatory. The HAMi device-plugin DaemonSet has a nodeSelector that requires it. I lost 20 minutes on this before I understood why DESIRED: 0 wasn't moving. Here's where it gets interesting. I deployed two PyTorch workloads simultaneously on the same physical GPU, each with hard VRAM limits. gpu_worker_a.yaml — light workload, 20% VRAM (~2GB), 25% SM cores: gpu_worker_b.yaml — heavier workload, 30% VRAM (~3GB), 40% SM cores: And on the host, nvidia-smi showed what I'd been trying to achieve for a long time: Two processes. Same physical GPU. Both running. Separately allocated VRAM slices. No code changes in the application. This was a 2AM session, so I hit every wall possible. Here are the real ones: HAMi's NVIDIA device counter needs to see nvidia.com/gpu as the entry point. Setting only gpucores and gpumem-percentage without it causes HAMi to skip the pod completely. You'll see "FilteringFailed: does not request any resource" in the scheduler logs but the pod will still get scheduled (by the fallback default scheduler) — without any VRAM isolation. Check the scheduler logs during pod creation to confirm: You want to see: "device allocate success" allocate device={"NVIDIA":[{"Usedmem":2048,"Usedcores":25}]} The HAMi device-plugin DaemonSet has nodeSelector: gpu: "on". Without labelling your node, the DaemonSet sits idle and HAMi's CUDA shim never gets injected. You'll think everything is working (pods schedule, run, use GPU) but there's no isolation happening. If pods show hami.io/bind-phase: allocating (not success), the device-plugin wasn't running when the pods were first scheduled. Delete the pods — Kubernetes recreates them, and this time the device-plugin will properly inject the shim. Having lived with both, here's the real difference. When you run docker run --gpus device=0, Docker mounts /dev/nvidia0, /dev/nvidiactl, and /dev/nvidia-uvm into the container. That's it. Every container pointed at the same GPU device sees the whole GPU. There is no VRAM wall. If container A runs that script while container B is also on GPU 0 — container B OOMs. Both processes die or degrade. There's no fence between them. The only mitigation available in pure Docker is application-level: This requires modifying application code, applies differently per framework, and can be bypassed accidentally or intentionally. HAMi injects libvgpu.so via LD_PRELOAD into each container's process. This library wraps every CUDA memory function. When your process calls cudaMalloc(size), HAMi checks your pod's cumulative allocation against its configured limit. If you'd exceed it, it returns CUDA_ERROR_OUT_OF_MEMORY immediately. No negotiation. The other pods on the same GPU are completely unaffected. Their VRAM slices are spatially isolated — different physical memory pages. I expected to need to wire up dcgm-exporter, configure Prometheus scrape configs, and build Grafana dashboards from scratch. Instead, HAMi ships two Prometheus metric endpoints out of the box. Port :31992 (device-plugin, real-time per-container): Port :31993 (scheduler, allocation view): GPUDeviceSharedNum: 2 — two containers sharing one GPU, confirmed from HAMi's perspective. Wire these to Prometheus with ServiceMonitors and you have a full observability story: hami_service_monitoring.yaml: This is important to get right before you put HAMi in production. VRAM (gpumem-percentage) — Hard enforcement. HAMi intercepts cudaMalloc in userspace. When your pod exceeds its limit, it gets CUDA_ERROR_OUT_OF_MEMORY. This is deterministic, reliable, and completely isolates the impact to the offending pod. SM cores (gpucores) — Soft enforcement. HAMi doesn't have a hardware mechanism to limit SM core usage on non-MIG GPUs. Instead, it monitors GPU utilization and injects cudaDeviceSynchronize() + sleep cycles to throttle kernel submissions when a pod exceeds its core budget. This is best-effort — expect ±5-10% deviation from your configured cap. The GPU doesn't enforce this at hardware level. For most use cases involving multiple AI workloads sharing a GPU, the hard VRAM wall is what matters most. SM throttling is a nice-to-have for fairness but not a safety guarantee. If you need hard SM guarantees, you're on MIG territory — A100/H100 only. Coming home from Amsterdam with a working HAMi PoC changes the architecture conversation significantly. Before: Two-tier GPU management. Docker API for short-lived containers. K8s for long-lived pods. Homebrew GPU pool tracker. No unified monitoring. No VRAM isolation. Multiple separate failure modes. After (planned): Single K8s cluster. HAMi handles all GPU slicing. Inference pods, processing jobs, and batch workloads all described as K8s Deployments or Jobs with nvidia.com/gpumem-percentage limits. Unified observability via HAMi's Prometheus endpoints. Automatic rescheduling on failure. Namespace quotas per team. The short-lived GPU job use case specifically — I'm confident HAMi can handle it. On-demand workloads with predictable, bounded VRAM usage are exactly what sub-GPU partitioning is designed for. You can pack several of them onto a single GPU that used to be allocated whole to one process at a time. Everything I built is in the files below. You need MicroK8s, an NVIDIA GPU, and about 30 minutes. Split a single physical NVIDIA GPU across multiple Kubernetes pods using HAMi (Heterogeneous AI Computing Virtualization Middleware). No MIG, no hardware partitioning — works on consumer GPUs. HAMi injects a CUDA shim via LD_PRELOAD into each container. The shim intercepts cudaMalloc and kernel launch calls to enforce per-pod limits: Both pods run truly in parallel on different SMs. Time-slicing only occurs under SM contention. Quick start (after MicroK8s is running with GPU addon enabled): I've been at this GPU sharing problem for some time. MIG was the dream but not the reality for most hardware budgets. Time-slicing was a band-aid. Our homebrew solution was genuinely good engineering but was always technical debt waiting to be paid. HAMi is the first thing I've found that genuinely plugs the gap — software-level VRAM isolation on commodity GPUs, K8s native, zero application changes, and built-in observability. It's not magic: the SM throttling is soft, the setup requires K8s knowledge, and there's still a ceiling of what you can pack onto a 10GB card. But it's real, it works, and it's an open CNCF project with active development. The fact that I found it at KubeCon, had a working PoC by 2AM, and was watching two pods cleanly share an RTX 3080 before I went to sleep — that's a pretty good endorsement. If you're running AI workloads on Kubernetes and you're wasting GPU budget on whole-GPU-per-pod allocations, give HAMi a look. Your platform budget will thank you. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# GPU tasks — Docker containers on the GPU node, launched via local Docker API container = docker_client.containers.run( image="our-model-server:latest", detach=True, device_requests=[ -weight: 500;">docker.types.DeviceRequest(device_ids=["0"], capabilities=[["gpu"]]) ], environment={ "CUDA_VISIBLE_DEVICES": "0", "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512" # hint only, not enforced } ) # GPU tasks — Docker containers on the GPU node, launched via local Docker API container = docker_client.containers.run( image="our-model-server:latest", detach=True, device_requests=[ -weight: 500;">docker.types.DeviceRequest(device_ids=["0"], capabilities=[["gpu"]]) ], environment={ "CUDA_VISIBLE_DEVICES": "0", "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512" # hint only, not enforced } ) # GPU tasks — Docker containers on the GPU node, launched via local Docker API container = docker_client.containers.run( image="our-model-server:latest", detach=True, device_requests=[ -weight: 500;">docker.types.DeviceRequest(device_ids=["0"], capabilities=[["gpu"]]) ], environment={ "CUDA_VISIBLE_DEVICES": "0", "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512" # hint only, not enforced } ) # CPU tasks — normal K8s pods, no special handling needed pod_manifest = { "apiVersion": "v1", "kind": "Pod", "spec": { "containers": [{ "resources": { "limits": {"cpu": "2", "memory": "4Gi"} } }] } } k8s_client.create_namespaced_pod(namespace="workloads", body=pod_manifest) # CPU tasks — normal K8s pods, no special handling needed pod_manifest = { "apiVersion": "v1", "kind": "Pod", "spec": { "containers": [{ "resources": { "limits": {"cpu": "2", "memory": "4Gi"} } }] } } k8s_client.create_namespaced_pod(namespace="workloads", body=pod_manifest) # CPU tasks — normal K8s pods, no special handling needed pod_manifest = { "apiVersion": "v1", "kind": "Pod", "spec": { "containers": [{ "resources": { "limits": {"cpu": "2", "memory": "4Gi"} } }] } } k8s_client.create_namespaced_pod(namespace="workloads", body=pod_manifest) Physical GPU (RTX 3080 — 10GB) ↓ NVIDIA Driver ↓ libvgpu.so ←── HAMi injects this via LD_PRELOAD (intercepts cudaMalloc, enforces per-pod limits) ↓ ┌─────────────┐ ┌─────────────┐ │ Pod A │ │ Pod B │ │ 2GB limit │ │ 3GB limit │ │ 25% cores │ │ 40% cores │ └─────────────┘ └─────────────┘ Physical GPU (RTX 3080 — 10GB) ↓ NVIDIA Driver ↓ libvgpu.so ←── HAMi injects this via LD_PRELOAD (intercepts cudaMalloc, enforces per-pod limits) ↓ ┌─────────────┐ ┌─────────────┐ │ Pod A │ │ Pod B │ │ 2GB limit │ │ 3GB limit │ │ 25% cores │ │ 40% cores │ └─────────────┘ └─────────────┘ Physical GPU (RTX 3080 — 10GB) ↓ NVIDIA Driver ↓ libvgpu.so ←── HAMi injects this via LD_PRELOAD (intercepts cudaMalloc, enforces per-pod limits) ↓ ┌─────────────┐ ┌─────────────┐ │ Pod A │ │ Pod B │ │ 2GB limit │ │ 3GB limit │ │ 25% cores │ │ 40% cores │ └─────────────┘ └─────────────┘ # Enable microk8s GPU addon microk8s -weight: 500;">enable gpu # Install cert-manager (HAMi's webhook needs it) microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager \ -n cert-manager --timeout=180s # Add HAMi helm repo microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update # Get K8s version (--short is deprecated in newer -weight: 500;">kubectl) K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c " import sys, json, re v = json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v') print(re.split(r'[+\-]', v)[0]) ") # Install HAMi microk8s helm3 -weight: 500;">install hami hami-charts/hami \ --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # CRITICAL: Label the GPU node — without this, the device-plugin DaemonSet stays at DESIRED: 0 microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # Enable microk8s GPU addon microk8s -weight: 500;">enable gpu # Install cert-manager (HAMi's webhook needs it) microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager \ -n cert-manager --timeout=180s # Add HAMi helm repo microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update # Get K8s version (--short is deprecated in newer -weight: 500;">kubectl) K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c " import sys, json, re v = json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v') print(re.split(r'[+\-]', v)[0]) ") # Install HAMi microk8s helm3 -weight: 500;">install hami hami-charts/hami \ --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # CRITICAL: Label the GPU node — without this, the device-plugin DaemonSet stays at DESIRED: 0 microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # Enable microk8s GPU addon microk8s -weight: 500;">enable gpu # Install cert-manager (HAMi's webhook needs it) microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager \ -n cert-manager --timeout=180s # Add HAMi helm repo microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update # Get K8s version (--short is deprecated in newer -weight: 500;">kubectl) K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c " import sys, json, re v = json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v') print(re.split(r'[+\-]', v)[0]) ") # Install HAMi microk8s helm3 -weight: 500;">install hami hami-charts/hami \ --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # CRITICAL: Label the GPU node — without this, the device-plugin DaemonSet stays at DESIRED: 0 microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-a spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-a template: metadata: labels: app: gpu-worker instance: worker-a spec: schedulerName: hami-scheduler # critical — tells K8s to use HAMi's scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-a') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) # Allocate 1.5GB resident tensor (well within 2GB limit) elements = (1500 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(1024, 1024, device=device, dtype=torch.float16) b = torch.randn(1024, 1024, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.1) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" # REQUIRED — HAMi trigger, without this it ignores the pod nvidia.com/gpucores: "25" # 25% SM core cap (soft throttle) nvidia.com/gpumem-percentage: "20" # 20% of VRAM = ~2048MB hard wall apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-a spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-a template: metadata: labels: app: gpu-worker instance: worker-a spec: schedulerName: hami-scheduler # critical — tells K8s to use HAMi's scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-a') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) # Allocate 1.5GB resident tensor (well within 2GB limit) elements = (1500 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(1024, 1024, device=device, dtype=torch.float16) b = torch.randn(1024, 1024, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.1) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" # REQUIRED — HAMi trigger, without this it ignores the pod nvidia.com/gpucores: "25" # 25% SM core cap (soft throttle) nvidia.com/gpumem-percentage: "20" # 20% of VRAM = ~2048MB hard wall apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-a spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-a template: metadata: labels: app: gpu-worker instance: worker-a spec: schedulerName: hami-scheduler # critical — tells K8s to use HAMi's scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-a') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) # Allocate 1.5GB resident tensor (well within 2GB limit) elements = (1500 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(1024, 1024, device=device, dtype=torch.float16) b = torch.randn(1024, 1024, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.1) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" # REQUIRED — HAMi trigger, without this it ignores the pod nvidia.com/gpucores: "25" # 25% SM core cap (soft throttle) nvidia.com/gpumem-percentage: "20" # 20% of VRAM = ~2048MB hard wall apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-b spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-b template: metadata: labels: app: gpu-worker instance: worker-b spec: schedulerName: hami-scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-b') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) elements = (2000 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(2048, 2048, device=device, dtype=torch.float16) b = torch.randn(2048, 2048, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.05) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" nvidia.com/gpucores: "40" nvidia.com/gpumem-percentage: "30" # 30% = ~3072MB hard wall apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-b spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-b template: metadata: labels: app: gpu-worker instance: worker-b spec: schedulerName: hami-scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-b') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) elements = (2000 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(2048, 2048, device=device, dtype=torch.float16) b = torch.randn(2048, 2048, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.05) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" nvidia.com/gpucores: "40" nvidia.com/gpumem-percentage: "30" # 30% = ~3072MB hard wall apiVersion: apps/v1 kind: Deployment metadata: name: gpu-worker-b spec: replicas: 1 selector: matchLabels: app: gpu-worker instance: worker-b template: metadata: labels: app: gpu-worker instance: worker-b spec: schedulerName: hami-scheduler containers: - name: gpu-worker image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime command: ["python3", "-u", "-c"] args: - | import torch, time, os pod = os.environ.get('POD_NAME', 'worker-b') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True) elements = (2000 * 1024 * 1024) // 4 blob = torch.zeros(elements, dtype=torch.float32, device=device) print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True) a = torch.randn(2048, 2048, device=device, dtype=torch.float16) b = torch.randn(2048, 2048, device=device, dtype=torch.float16) i = 0 while True: c = torch.matmul(a, b) torch.cuda.synchronize() i += 1 if i % 100 == 0: print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True) time.sleep(0.05) env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: PYTHONUNBUFFERED value: "1" resources: limits: nvidia.com/gpu: "1" nvidia.com/gpucores: "40" nvidia.com/gpumem-percentage: "30" # 30% = ~3072MB hard wall microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # Watch logs from both simultaneously microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # Watch logs from both simultaneously microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # Watch logs from both simultaneously microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f [pod/gpu-worker-a-.../gpu-worker] [worker-a] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] VRAM allocated: 2000MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] VRAM allocated: 1500MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=100 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=100 vram=1514MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=200 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=200 vram=1514MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] VRAM allocated: 2000MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] VRAM allocated: 1500MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=100 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=100 vram=1514MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=200 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=200 vram=1514MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] device=cuda gpu=NVIDIA GeForce RTX 3080 [pod/gpu-worker-b-.../gpu-worker] [worker-b] VRAM allocated: 2000MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] VRAM allocated: 1500MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=100 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=100 vram=1514MB [pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=200 vram=2032MB [pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=200 vram=1514MB +-------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | |======================================================================================| | 0 N/A N/A 116033 C python3 1828MiB | | 0 N/A N/A 116034 C python3 2860MiB | +-------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | |======================================================================================| | 0 N/A N/A 116033 C python3 1828MiB | | 0 N/A N/A 116034 C python3 2860MiB | +-------------------------------------------------------------------------------------+ +-------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | |======================================================================================| | 0 N/A N/A 116033 C python3 1828MiB | | 0 N/A N/A 116034 C python3 2860MiB | +-------------------------------------------------------------------------------------+ microk8s -weight: 500;">kubectl logs -n kube-system \ $(microk8s -weight: 500;">kubectl get pod -n kube-system \ -l app.kubernetes.io/component=hami-scheduler \ -o jsonpath='{.items[0].metadata.name}') \ -c vgpu-scheduler-extender --since=2m | grep "allocate success" microk8s -weight: 500;">kubectl logs -n kube-system \ $(microk8s -weight: 500;">kubectl get pod -n kube-system \ -l app.kubernetes.io/component=hami-scheduler \ -o jsonpath='{.items[0].metadata.name}') \ -c vgpu-scheduler-extender --since=2m | grep "allocate success" microk8s -weight: 500;">kubectl logs -n kube-system \ $(microk8s -weight: 500;">kubectl get pod -n kube-system \ -l app.kubernetes.io/component=hami-scheduler \ -o jsonpath='{.items[0].metadata.name}') \ -c vgpu-scheduler-extender --since=2m | grep "allocate success" # This is required — do it right after HAMi -weight: 500;">install microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # This is required — do it right after HAMi -weight: 500;">install microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # This is required — do it right after HAMi -weight: 500;">install microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on microk8s -weight: 500;">kubectl delete pod -l app=gpu-worker # They reschedule automatically via the Deployment microk8s -weight: 500;">kubectl delete pod -l app=gpu-worker # They reschedule automatically via the Deployment microk8s -weight: 500;">kubectl delete pod -l app=gpu-worker # They reschedule automatically via the Deployment # Must be non-empty microk8s -weight: 500;">kubectl exec $POD_A -- env | grep CUDA_DEVICE_MEMORY_SHARED_CACHE # Must say "success" microk8s -weight: 500;">kubectl get pods -l app=gpu-worker -o yaml | grep "bind-phase" # Must be non-empty microk8s -weight: 500;">kubectl exec $POD_A -- env | grep CUDA_DEVICE_MEMORY_SHARED_CACHE # Must say "success" microk8s -weight: 500;">kubectl get pods -l app=gpu-worker -o yaml | grep "bind-phase" # Must be non-empty microk8s -weight: 500;">kubectl exec $POD_A -- env | grep CUDA_DEVICE_MEMORY_SHARED_CACHE # Must say "success" microk8s -weight: 500;">kubectl get pods -l app=gpu-worker -o yaml | grep "bind-phase" # Two containers, same GPU, no isolation -weight: 500;">docker run --gpus device=0 -e NVIDIA_VISIBLE_DEVICES=0 pytorch/pytorch:latest python3 -c " import torch # This will happily allocate ALL available VRAM blob = torch.zeros(9_000_000_000 // 4, dtype=torch.float32, device='cuda') print(f'Allocated: {torch.cuda.memory_allocated() // 1024**2}MB') " # Two containers, same GPU, no isolation -weight: 500;">docker run --gpus device=0 -e NVIDIA_VISIBLE_DEVICES=0 pytorch/pytorch:latest python3 -c " import torch # This will happily allocate ALL available VRAM blob = torch.zeros(9_000_000_000 // 4, dtype=torch.float32, device='cuda') print(f'Allocated: {torch.cuda.memory_allocated() // 1024**2}MB') " # Two containers, same GPU, no isolation -weight: 500;">docker run --gpus device=0 -e NVIDIA_VISIBLE_DEVICES=0 pytorch/pytorch:latest python3 -c " import torch # This will happily allocate ALL available VRAM blob = torch.zeros(9_000_000_000 // 4, dtype=torch.float32, device='cuda') print(f'Allocated: {torch.cuda.memory_allocated() // 1024**2}MB') " # This is a suggestion, not enforcement torch.cuda.set_per_process_memory_fraction(0.5, device=0) # This is a suggestion, not enforcement torch.cuda.set_per_process_memory_fraction(0.5, device=0) # This is a suggestion, not enforcement torch.cuda.set_per_process_memory_fraction(0.5, device=0) Container calls cudaMalloc(1GB) ↓ libvgpu.so intercepts ↓ cumulative_alloc + 1GB > pod_limit? YES → return CUDA_ERROR_OUT_OF_MEMORY (your pod, your problem) NO → pass through to real cudaMalloc Container calls cudaMalloc(1GB) ↓ libvgpu.so intercepts ↓ cumulative_alloc + 1GB > pod_limit? YES → return CUDA_ERROR_OUT_OF_MEMORY (your pod, your problem) NO → pass through to real cudaMalloc Container calls cudaMalloc(1GB) ↓ libvgpu.so intercepts ↓ cumulative_alloc + 1GB > pod_limit? YES → return CUDA_ERROR_OUT_OF_MEMORY (your pod, your problem) NO → pass through to real cudaMalloc -weight: 500;">curl -s http://localhost:31992/metrics | grep -v "^#" -weight: 500;">curl -s http://localhost:31992/metrics | grep -v "^#" -weight: 500;">curl -s http://localhost:31992/metrics | grep -v "^#" vGPU_device_memory_usage_in_bytes{podname="gpu-worker-a",...} 1.82884864e+09 vGPU_device_memory_usage_in_bytes{podname="gpu-worker-b",...} 2.39507968e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-a",...} 2.147483648e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-b",...} 3.221225472e+09 Device_utilization_desc_of_container{podname="gpu-worker-a",...} 12 Device_utilization_desc_of_container{podname="gpu-worker-b",...} 31 HostCoreUtilization{deviceuuid="GPU-53aae475-...",...} 14 HostGPUMemoryUsage{deviceuuid="GPU-53aae475-...",...} 5.82e+09 vGPU_device_memory_usage_in_bytes{podname="gpu-worker-a",...} 1.82884864e+09 vGPU_device_memory_usage_in_bytes{podname="gpu-worker-b",...} 2.39507968e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-a",...} 2.147483648e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-b",...} 3.221225472e+09 Device_utilization_desc_of_container{podname="gpu-worker-a",...} 12 Device_utilization_desc_of_container{podname="gpu-worker-b",...} 31 HostCoreUtilization{deviceuuid="GPU-53aae475-...",...} 14 HostGPUMemoryUsage{deviceuuid="GPU-53aae475-...",...} 5.82e+09 vGPU_device_memory_usage_in_bytes{podname="gpu-worker-a",...} 1.82884864e+09 vGPU_device_memory_usage_in_bytes{podname="gpu-worker-b",...} 2.39507968e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-a",...} 2.147483648e+09 vGPU_device_memory_limit_in_bytes{podname="gpu-worker-b",...} 3.221225472e+09 Device_utilization_desc_of_container{podname="gpu-worker-a",...} 12 Device_utilization_desc_of_container{podname="gpu-worker-b",...} 31 HostCoreUtilization{deviceuuid="GPU-53aae475-...",...} 14 HostGPUMemoryUsage{deviceuuid="GPU-53aae475-...",...} 5.82e+09 -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" GPUDeviceSharedNum{...} 2 GPUDeviceCoreAllocated{...} 65 GPUDeviceMemoryAllocated{...} 5.36870912e+09 vGPUCoreAllocated{podname="gpu-worker-a",...} 25 vGPUCoreAllocated{podname="gpu-worker-b",...} 40 vGPUMemoryAllocated{podname="gpu-worker-a",...} 2.147483648e+09 vGPUMemoryAllocated{podname="gpu-worker-b",...} 3.221225472e+09 QuotaUsed{quotaName="nvidia.com/gpumem", quotanamespace="default",...} 5120 GPUDeviceSharedNum{...} 2 GPUDeviceCoreAllocated{...} 65 GPUDeviceMemoryAllocated{...} 5.36870912e+09 vGPUCoreAllocated{podname="gpu-worker-a",...} 25 vGPUCoreAllocated{podname="gpu-worker-b",...} 40 vGPUMemoryAllocated{podname="gpu-worker-a",...} 2.147483648e+09 vGPUMemoryAllocated{podname="gpu-worker-b",...} 3.221225472e+09 QuotaUsed{quotaName="nvidia.com/gpumem", quotanamespace="default",...} 5120 GPUDeviceSharedNum{...} 2 GPUDeviceCoreAllocated{...} 65 GPUDeviceMemoryAllocated{...} 5.36870912e+09 vGPUCoreAllocated{podname="gpu-worker-a",...} 25 vGPUCoreAllocated{podname="gpu-worker-b",...} 40 vGPUMemoryAllocated{podname="gpu-worker-a",...} 2.147483648e+09 vGPUMemoryAllocated{podname="gpu-worker-b",...} 3.221225472e+09 QuotaUsed{quotaName="nvidia.com/gpumem", quotanamespace="default",...} 5120 apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-scheduler-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-scheduler app.kubernetes.io/instance: hami endpoints: - port: monitor # → pod :9395 interval: 10s path: /metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-device-plugin-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-device-plugin app.kubernetes.io/instance: hami endpoints: - port: monitorport # → pod :9394 interval: 5s path: /metrics apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-scheduler-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-scheduler app.kubernetes.io/instance: hami endpoints: - port: monitor # → pod :9395 interval: 10s path: /metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-device-plugin-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-device-plugin app.kubernetes.io/instance: hami endpoints: - port: monitorport # → pod :9394 interval: 5s path: /metrics apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-scheduler-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-scheduler app.kubernetes.io/instance: hami endpoints: - port: monitor # → pod :9395 interval: 10s path: /metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-device-plugin-metrics namespace: observability labels: release: kube-prom-stack spec: namespaceSelector: matchNames: - kube-system selector: matchLabels: app.kubernetes.io/component: hami-device-plugin app.kubernetes.io/instance: hami endpoints: - port: monitorport # → pod :9394 interval: 5s path: /metrics gpumem-percentage: "20" → Hard. If exceeded → CUDA_ERROR_OUT_OF_MEMORY. Deterministic. gpucores: "25" → Soft. Best-effort ±5-10%. Not a hardware guarantee. gpumem-percentage: "20" → Hard. If exceeded → CUDA_ERROR_OUT_OF_MEMORY. Deterministic. gpucores: "25" → Soft. Best-effort ±5-10%. Not a hardware guarantee. gpumem-percentage: "20" → Hard. If exceeded → CUDA_ERROR_OUT_OF_MEMORY. Deterministic. gpucores: "25" → Soft. Best-effort ±5-10%. Not a hardware guarantee. Physical GPU (e.g. RTX 3080 — 10 GB VRAM) ├── gpu-worker-a → 20% VRAM (~2 GB) + 25% SM cores └── gpu-worker-b → 30% VRAM (~3 GB) + 40% SM cores Physical GPU (e.g. RTX 3080 — 10 GB VRAM) ├── gpu-worker-a → 20% VRAM (~2 GB) + 25% SM cores └── gpu-worker-b → 30% VRAM (~3 GB) + 40% SM cores # 1. Install cert-manager microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=180s # 2. Install HAMi microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c \ "import sys,json,re; v=json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v'); print(re.split(r'[+\-]',v)[0])") microk8s helm3 -weight: 500;">install hami hami-charts/hami --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # 3. Label your GPU node microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # 4. Deploy workers microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # 5. Watch the magic microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f & watch -n 2 nvidia-smi # 6. Verify HAMi's view of the split -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" # 1. Install cert-manager microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=180s # 2. Install HAMi microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c \ "import sys,json,re; v=json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v'); print(re.split(r'[+\-]',v)[0])") microk8s helm3 -weight: 500;">install hami hami-charts/hami --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # 3. Label your GPU node microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # 4. Deploy workers microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # 5. Watch the magic microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f & watch -n 2 nvidia-smi # 6. Verify HAMi's view of the split -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" # 1. Install cert-manager microk8s -weight: 500;">kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml microk8s -weight: 500;">kubectl wait --for=condition=ready pod \ -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=180s # 2. Install HAMi microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/ microk8s helm3 repo -weight: 500;">update K8S_VERSION=$(microk8s -weight: 500;">kubectl version -o json | python3 -c \ "import sys,json,re; v=json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v'); print(re.split(r'[+\-]',v)[0])") microk8s helm3 -weight: 500;">install hami hami-charts/hami --namespace kube-system \ --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \ --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \ --set scheduler.defaultSchedulerPolicy.gpuMemory=true \ --set scheduler.defaultSchedulerPolicy.gpuCores=true # 3. Label your GPU node microk8s -weight: 500;">kubectl label node <your-node-name> gpu=on # 4. Deploy workers microk8s -weight: 500;">kubectl apply -f gpu_worker_a.yaml microk8s -weight: 500;">kubectl apply -f gpu_worker_b.yaml # 5. Watch the magic microk8s -weight: 500;">kubectl logs -l app=gpu-worker --prefix=true -f & watch -n 2 nvidia-smi # 6. Verify HAMi's view of the split -weight: 500;">curl -s http://localhost:31993/metrics | grep -v "^#" - NVIDIA Time-Slicing — Easy to set up via the GPU Operator and it looks good on paper. In practice, for streaming and transcoding workloads it was a non-starter. Time-slicing serialises GPU access, which introduces jitter and latency spikes — exactly what you cannot have when you're processing live video or audio. Frames drop, buffers stall, quality degrades. We turned it off fast. - Plain Docker with --gpus device=0 — Which I'll get into. We actually used this for a long time, and it worked — sort of. - No VRAM isolation — Docker containers on the same GPU shared memory completely. One greedy process could OOM the rest, and when it happened everything fell over at once. - GPU workloads living outside K8s — a whole class of tasks with no K8s lifecycle management, no health checks, no rolling restarts. A permanent special case that needed permanent special handling. - Node affinity as a constraint, not a choice — the orchestrator had to be pinned to the GPU node to reach the Docker daemon. Scaling to multiple GPU nodes meant more orchestrators, more complexity, more things to coordinate. - No per-container GPU metrics — visibility into who was using what meant scraping nvidia-smi and correlating PIDs manually. Fragile and tedious. - Unlike MIG, it doesn't require specific hardware - Unlike time-slicing, it enforces VRAM isolation (not just temporal sharing) - Unlike MPS, a failing pod doesn't crash the shared context - Unlike plain Docker, it's K8s-native and actually enforces limits - VRAM — hard cap; pod is OOM-killed if it exceeds its allocation - GPU cores — soft cap via kernel submission throttling (±5–10% deviation is normal) - Ubuntu 22.04 / 24.04 - NVIDIA driver installed on host (nvidia-smi works) - MicroK8s installed (snap -weight: 500;">install microk8s --classic) - sanity_check.yaml — Verify GPU access before installing HAMi - gpu_worker_a.yaml — Worker A deployment (20% VRAM, 25% cores) - gpu_worker_b.yaml — Worker B deployment (30% VRAM, 40% cores) - hami_service_monitoring.yaml — Prometheus ServiceMonitors for both HAMi endpoints - grafana_dashboard.yaml — Auto-importing Grafana dashboard via ConfigMap