Tools

Tools: llama.cpp on Kubernetes: The Guide I Wish Existed

2026-03-17 0 views admin

What We're Building Toward

The Problem with "Just Run llama.cpp"

How LLMKube Approaches This

My Actual Setup

The Metal Agent: Running Apple Silicon in Your Cluster

Multi-GPU: Splitting Models Across Cards

Hands-On: Try It in 10 Minutes

Prerequisites

Install LLMKube

Deploy Your First Model

Query It

Use It With the OpenAI SDK

Add GPU Acceleration

Air-Gapped Deployments

Where LLMKube Fits (and Where It Doesn't)

What's Next It started at my kitchen table. I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked. If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model. Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly. But then the questions started piling up. How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GPUs on my Linux server AND the Metal GPU on my Mac? How do I monitor it? How do I manage model versions? I come from a DevOps background, so my brain immediately went to Kubernetes. I figured someone had already built this. And while there are some incredible tools out there (Ollama for single-machine use, vLLM for high-throughput NVIDIA clusters), nothing quite did what I wanted: a Kubernetes operator that treats LLM inference as a first-class workload across heterogeneous hardware, including Apple Silicon. So I started building LLMKube, an open-source Kubernetes operator for running LLMs with llama.cpp. I'm a big believer in open source, and I wanted this to be open source from day one. The best infrastructure tools are built by communities, not individuals. This guide is everything I've learned along the way. By the end of this guide, you'll understand how to: If you just want to try it out quickly, skip ahead to the hands-on quickstart. llama.cpp is an outstanding project. It runs on virtually any hardware, supports dozens of model architectures, and the GGUF format has become the standard for local inference. If you need to run one model on one machine, llama.cpp with llama-server is honestly all you need. The challenges show up when you want to operationalize it: Model lifecycle. You need to download models, verify their integrity, cache them so pods don't re-download 30GB files on every restart, and keep track of what's deployed where. GPU scheduling. If you have multiple models competing for limited GPU memory, you need something smarter than "first pod wins." Priority queues, memory budgets, and graceful handling of GPU contention all matter when you have real workloads. Heterogeneous hardware. This is the big one. Apple Silicon's Metal GPU can't be accessed from inside a container. Every Kubernetes-based LLM tool I found either ignored Macs entirely or ran them in CPU-only mode, which throws away the best part of the hardware. If you have a Mac Studio with an M4 Ultra sitting on your desk and a Linux server with NVIDIA GPUs in your closet, you shouldn't have to choose between them. Observability. If you're already running Prometheus and Grafana (and if you're running Kubernetes, you probably are), you want inference metrics in the same stack as everything else. Tokens per second, prompt processing time, GPU utilization, model load times, all in one place. LLMKube adds two Custom Resource Definitions to your Kubernetes cluster: Model defines what you want to run: the GGUF source URL, quantization level, GPU requirements, and hardware preferences. InferenceService defines how you want to run it: replicas, resource limits, endpoint configuration, and which Model to reference. The operator watches these resources and handles everything in between: downloading the model, creating deployments, configuring health checks, setting up llama-server with the right flags, exposing an OpenAI-compatible API, and cleaning up when you delete resources. That's it. The operator takes it from there. I want to be transparent about the hardware I run this on, because I think it's important for people to see that you don't need datacenter-grade equipment to make this work. Shadowstack is my primary inference server. It's a desktop PC I built specifically for this: Mac Studio (M4 Ultra, 36GB unified memory) runs the Metal Agent, which lets Kubernetes orchestrate llama-server natively on macOS with full Metal GPU access. Mac Mini handles other orchestration workloads. On Shadowstack, I run Qwen3 32B with the model split across both 5060 Tis using tensor parallelism. On the Mac Studio, I run Qwen 30B-A3B (a mixture-of-experts model that fits comfortably in 36GB of unified memory). Both are managed by the same LLMKube operator, using the same CRDs, visible through the same monitoring stack. Is 36GB of unified memory on the Mac Studio less than I wish I had? Sure. But it still runs a 30B MoE model for real workloads, and that's the point. You work with the hardware you have. This is the part that gets me the most excited, and the part that I haven't seen anyone else solve. Here's the core problem: Apple Silicon GPUs use Metal, not CUDA. Metal isn't accessible from inside a Docker container. So if you put a Mac in your Kubernetes cluster and deploy a pod to it, that pod can only use the CPU. Your M4 Ultra's GPU sits idle. The Metal Agent works around this by inverting the typical Kubernetes model. Instead of running inference inside a container, the Metal Agent runs as a native macOS daemon that: From the perspective of any other service in your cluster, the model running on your Mac looks like any other Kubernetes-managed endpoint. You can hit the same OpenAI-compatible API, the same health checks work, the same Prometheus metrics are exposed. The same CRD that deploys a model on NVIDIA with CUDA deploys on Apple Silicon with Metal. Just change accelerator: cuda to accelerator: metal. If you want to run models larger than what fits on a single GPU, llama.cpp supports tensor parallelism across multiple GPUs on the same node. LLMKube automates this through the GPU sharding spec. On my Shadowstack box, Qwen3 32B (quantized to Q4_K_M, roughly 20GB) gets split across both 5060 Tis. Each GPU handles a portion of the model's layers, and llama.cpp coordinates the inference across both cards. The operator automatically calculates the tensor split ratios and passes the right flags to llama-server. On the dual 5060 Ti setup, I see consistent ~53 tokens/second for 3-8B models and solid performance on the 32B model with the split. You don't need my hardware to try this. Here's the quickest path from zero to running inference on Kubernetes. That single command creates both the Model and InferenceService resources. The operator downloads the GGUF file, spins up a pod with llama-server, and exposes an OpenAI-compatible API. You can also deploy any GGUF model by providing a --source URL pointing to HuggingFace or any HTTP endpoint. Since the API is OpenAI-compatible, you can point any OpenAI SDK client at it: This works with LangChain, LlamaIndex, and anything else that speaks the OpenAI API. If you have an NVIDIA GPU available in your cluster: The difference is dramatic. On an NVIDIA L4 in GKE, prompt processing goes from 29 tok/s (CPU) to 1,026 tok/s (GPU). Token generation jumps from 4.6 tok/s to 64 tok/s. That's a 17x speedup on generation and 66x on prompt processing. Early in my career, I worked in medical IT. That experience gave me an appreciation for environments where data simply cannot leave the network. Healthcare, defense, finance, government: these industries have strict compliance requirements that make cloud-hosted AI a non-starter. LLMKube supports air-gapped deployment through PVC-based model sources with SHA256 integrity verification: You stage models to a PersistentVolumeClaim, provide the checksum, and the operator verifies integrity before deploying. No outbound network calls, no container registry pulls at runtime, no data leaving your network. This is an area where I think llama.cpp really shines for Kubernetes deployments. The GGUF format is a single file. There's no Python dependency tree, no model sharding across dozens of files, no runtime downloads of tokenizers. You put one file on a PVC, point a CRD at it, and you're running. I want to be honest about this, because there are great tools in this space and picking the right one matters. If you need maximum throughput for high-concurrency workloads (50+ simultaneous users), use vLLM or SGLang. They use PagedAttention, continuous batching, and other optimizations that llama.cpp doesn't have. At scale, vLLM delivers significantly higher request throughput. That's just the reality. If you just need to run one model on one machine, use Ollama. It's simpler, it's elegant, and it handles the single-machine case better than a Kubernetes operator ever will. LLMKube is for the space in between. You have a Kubernetes cluster. You have a mix of hardware (maybe NVIDIA GPUs, maybe Apple Silicon, maybe both). You want Kubernetes-native lifecycle management with CRDs, GitOps workflows, and your inference metrics in the same Prometheus/Grafana stack as everything else. You care about air-gapped deployments, GPU scheduling, and model versioning. You're serving a team or a set of internal workloads, not a public-facing API with thousands of concurrent users. If that sounds like your situation, LLMKube might be what you're looking for. If it doesn't, I genuinely hope one of the other tools solves your problem. We all benefit from this ecosystem getting better. LLMKube is open source (Apache 2.0) and actively developed. Some things I'm excited about on the roadmap: I'll be honest about one thing that comes up a lot: multi-node distributed inference. llama.cpp has an RPC backend that can split a model across machines over ethernet, and I've been watching it closely. The reality is that over consumer networking (1GbE, 2.5GbE), the performance hit from network round-trips makes it marginal for interactive use. Jeff Geerling tested a four-node Framework cluster and got 0.7 tok/s on Llama 405B. The tech is improving, but today my advice is to scale vertically first. Get a bigger GPU or more unified memory before trying to split across machines. If the RPC backend matures to the point where it's genuinely usable over ethernet, LLMKube will support it, but I'm not going to promise something that isn't ready. If any of this is interesting to you, I'd love to hear from you. The project is at github.com/defilantech/LLMKube, and we have a Discord where I hang out and talk about this stuff regularly. If you hit issues, open a GitHub issue. If you want to contribute, check the issues labeled good-first-issue. And if you just want to say hi, that's cool too. Thanks for reading. I hope this saves you some of the time I spent figuring all this out. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

apiVersion: inference.llmkube.dev/v1alpha1 kind: Model metadata: name: llama-3-8b spec: source: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf format: gguf quantization: Q4_K_M hardware: accelerator: cuda gpu: count: 1 --- apiVersion: inference.llmkube.dev/v1alpha1 kind: InferenceService metadata: name: llama-3-8b spec: modelRef: llama-3-8b replicas: 1 resources: cpu: "2" memory: "4Gi" apiVersion: inference.llmkube.dev/v1alpha1 kind: Model metadata: name: llama-3-8b spec: source: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf format: gguf quantization: Q4_K_M hardware: accelerator: cuda gpu: count: 1 --- apiVersion: inference.llmkube.dev/v1alpha1 kind: InferenceService metadata: name: llama-3-8b spec: modelRef: llama-3-8b replicas: 1 resources: cpu: "2" memory: "4Gi" apiVersion: inference.llmkube.dev/v1alpha1 kind: Model metadata: name: llama-3-8b spec: source: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf format: gguf quantization: Q4_K_M hardware: accelerator: cuda gpu: count: 1 --- apiVersion: inference.llmkube.dev/v1alpha1 kind: InferenceService metadata: name: llama-3-8b spec: modelRef: llama-3-8b replicas: 1 resources: cpu: "2" memory: "4Gi" # On your Mac brew install llama.cpp llmkube-metal-agent --host-ip 192.168.1.x # From anywhere in the cluster llmkube deploy qwen-30b-a3b --accelerator metal # On your Mac brew install llama.cpp llmkube-metal-agent --host-ip 192.168.1.x # From anywhere in the cluster llmkube deploy qwen-30b-a3b --accelerator metal # On your Mac brew install llama.cpp llmkube-metal-agent --host-ip 192.168.1.x # From anywhere in the cluster llmkube deploy qwen-30b-a3b --accelerator metal spec: hardware: accelerator: cuda gpu: count: 2 sharding: strategy: layer spec: hardware: accelerator: cuda gpu: count: 2 sharding: strategy: layer spec: hardware: accelerator: cuda gpu: count: 2 sharding: strategy: layer # Install the CLI brew install defilantech/tap/llmkube # Add the Helm repo and install the operator helm repo add llmkube https://defilantech.github.io/LLMKube helm install llmkube llmkube/llmkube \ --namespace llmkube-system \ --create-namespace # Install the CLI brew install defilantech/tap/llmkube # Add the Helm repo and install the operator helm repo add llmkube https://defilantech.github.io/LLMKube helm install llmkube llmkube/llmkube \ --namespace llmkube-system \ --create-namespace # Install the CLI brew install defilantech/tap/llmkube # Add the Helm repo and install the operator helm repo add llmkube https://defilantech.github.io/LLMKube helm install llmkube llmkube/llmkube \ --namespace llmkube-system \ --create-namespace # Deploy Phi-4 Mini (3.8B params, from the built-in catalog) llmkube deploy phi-4-mini # Deploy Phi-4 Mini (3.8B params, from the built-in catalog) llmkube deploy phi-4-mini # Deploy Phi-4 Mini (3.8B params, from the built-in catalog) llmkube deploy phi-4-mini # Port-forward and test kubectl port-forward svc/phi-4-mini 8080:8080 & curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What is Kubernetes in one sentence?"} ], "max_tokens": 100 }' # Port-forward and test kubectl port-forward svc/phi-4-mini 8080:8080 & curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What is Kubernetes in one sentence?"} ], "max_tokens": 100 }' # Port-forward and test kubectl port-forward svc/phi-4-mini 8080:8080 & curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What is Kubernetes in one sentence?"} ], "max_tokens": 100 }' from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="phi-4-mini", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="phi-4-mini", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="phi-4-mini", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) llmkube deploy llama-3.1-8b --gpu --gpu-count 1 llmkube deploy llama-3.1-8b --gpu --gpu-count 1 llmkube deploy llama-3.1-8b --gpu --gpu-count 1 spec: source: pvc://model-storage/models/llama-3-8b-q4.gguf sha256: a1b2c3d4e5f6... spec: source: pvc://model-storage/models/llama-3-8b-q4.gguf sha256: a1b2c3d4e5f6... spec: source: pvc://model-storage/models/llama-3-8b-q4.gguf sha256: a1b2c3d4e5f6... - Run llama.cpp on Kubernetes with proper lifecycle management - Deploy models with a single command or a two-resource YAML - Use NVIDIA GPUs with CUDA acceleration - Use Apple Silicon Macs as GPU inference nodes in your cluster - Split models across multiple GPUs for larger models - Monitor everything with Prometheus and Grafana - AMD Ryzen 9 7900X (12 cores / 24 threads) - 64GB DDR5-6000 - 2x NVIDIA RTX 5060 Ti (16GB VRAM each, 32GB total) - Samsung 990 Pro 1TB NVMe - Running MicroK8s as a single-node Kubernetes cluster - Watches the Kubernetes API for InferenceService resources with accelerator: metal - Spawns llama-server natively on macOS with full Metal GPU access - Registers the endpoint back into Kubernetes so other services can route to it - A Kubernetes cluster (Minikube, kind, K3s, or any managed cluster) - kubectl configured - Edge deployment support for lightweight Kubernetes distributions like K3s and MicroK8s - AMD GPU support (ROCm) with a community contributor already testing on Framework hardware with a Ryzen AI Max+ 395 - llmkube chat for testing models directly from the CLI without needing curl

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsllamakubernetesguideexistedbuilding

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

2026-03-30 0

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

2026-03-30 0

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

2026-03-30 0

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

2026-03-30 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: llama.cpp on Kubernetes: The Guide I Wish Existed

What We're Building Toward

The Problem with "Just Run llama.cpp"

How LLMKube Approaches This

My Actual Setup

The Metal Agent: Running Apple Silicon in Your Cluster

Multi-GPU: Splitting Models Across Cards

Hands-On: Try It in 10 Minutes

Prerequisites

Install LLMKube

Deploy Your First Model

Query It

Use It With the OpenAI SDK

Add GPU Acceleration

Air-Gapped Deployments

Where LLMKube Fits (and Where It Doesn't)

🏷️ Tags

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting