Tools

Tools: How to Deploy an Open Source LLM Reliably on Kubernetes (Step-by-Step) (2026)

2026-04-19 0 views admin

Introduction**

What We Are Building

Prerequisites

Step 1: Create the Kubernetes Cluster

Step 2: Deploy Ollama + TinyLlama

Step 3: Monitoring with Prometheus and Grafana

Import the Kubernetes Dashboard

Create a Custom Ollama Panel

Step 4: Build the Chatbot UI

Results

The Real Story Behind These Numbers

When to Use Each

Lessons Learned

What to Try Next

Full Code

Conclusion **# How to Deploy an Open Source LLM Reliably on Kubernetes Running AI models in production requires more than just downloading a model and running it locally. Anyone can run ollama run mistral in a terminal — but what happens when that process crashes at 2am? What happens when you need to monitor memory usage, restart failed services automatically, or scale to handle more requests? That is exactly what Kubernetes solves. In this guide I will walk you through the complete process of deploying TinyLlama (a real open source LLM) inside a production-grade Kubernetes cluster on your local machine — with live monitoring via Grafana, a working chatbot UI, and a head-to-head performance comparison against Claude Haiku from Anthropic. By the end you will have a fully working AI stack that auto-restarts on failure, shows you real-time health metrics, and costs you nothing to run. Full code: https://github.com/daksh777f/llm-on-kubernetes [Next.js Chatbot UI :3001]

↓[Ollama API :11434]↓[Kubernetes Cluster — k3d]├── llm namespace│ └── Ollama Pod (serves TinyLlama)└── monitoring namespace├── Prometheus (scrapes metrics)└── Grafana (visualizes dashboards) The full stack uses these tools: Before starting, you need these installed: On Windows, install the last four with one command using Chocolatey: We use k3d because it creates a real multi-node Kubernetes cluster that runs entirely inside Docker containers. No cloud account, no VM setup, no cost. This creates a cluster with one server node and one agent node. The --port flag exposes port 80 through a load balancer for later use. Verify both nodes are ready: You should see two nodes — one control-plane and one agent — both with status Ready. That is your Kubernetes cluster running locally inside Docker. Ollama is an open source tool that serves LLMs as a REST API. We deploy it as a Kubernetes Deployment so that if the pod ever crashes, Kubernetes automatically restarts it. Create a file called ollama.yaml: Notice the livenessProbe and readinessProbe — these are what make the deployment reliable. Kubernetes will continuously check if Ollama is responding. If it stops responding, Kubernetes kills the pod and starts a fresh one automatically. Apply the deployment: Wait for the pod to be running: Once the pod shows 1/1 Running, pull TinyLlama into it: This downloads the 637MB TinyLlama model inside the running pod. You should see a response field with text from TinyLlama. Your LLM is running inside Kubernetes. Deploying without monitoring is flying blind. We use the kube-prometheus-stack Helm chart which installs Prometheus, Grafana, and all the necessary exporters in a single command. Install the monitoring stack: The last two --set flags reduce memory usage — important on 8GB machines. The alertmanager.enabled=false flag skips the alert manager to save another ~200MB. Wait for all pods to reach Running status: Access Grafana by port-forwarding: Open http://localhost:3000 and login with admin / admin123. You now have a live dashboard showing cluster CPU, memory, pod counts, and network traffic — all updating in real time. This panel shows you at a glance whether your LLM pod is up. In a production setup, you would wire this to an alert that pages you when it goes down. The chatbot is built with Next.js and Tailwind CSS. It talks to TinyLlama through a server-side API route that calls Ollama directly. The API route (app/api/chat/route.ts) handles all communication with Ollama: Run on port 3001 since Grafana already uses port 3000: Open http://localhost:3001 — you have a working chatbot talking to your Kubernetes-hosted LLM. This is where it gets interesting. I wrote a Python script that ran 10 identical prompts through both TinyLlama running in my Kubernetes cluster and Claude Haiku from Anthropic's API, measuring response time and cost for every single query. The 10 prompts covered a range of task types: Claude is 49 times faster. That gap comes entirely from hardware. TinyLlama running on a CPU on an 8GB laptop takes 20 seconds. The same model running on a GPU node in Kubernetes would take under 1 second. Claude runs on Anthropic's optimized GPU infrastructure and responds in under half a second consistently. TinyLlama is completely free. For 10 queries, 100 queries, or 1 million queries — the cost is exactly zero. You pay for electricity and hardware you already own. Claude charges per token, which at Haiku pricing is extremely cheap (~$0.0003 per query) but adds up at scale. At 1 million queries per day, that is $300/day vs $0/day. Privacy is where local wins completely. Every query you send to Claude goes over the internet to Anthropic's servers. Every query you send to TinyLlama in your Kubernetes cluster never leaves your machine. For healthcare, legal, or financial applications where data privacy is non-negotiable, local is the only option. Use TinyLlama (local Kubernetes) when: Use Claude / commercial API when: The production answer is usually both. Route privacy-sensitive queries to your local Kubernetes LLM. Route quality-critical queries to a commercial API. This hybrid architecture gives you the best of both worlds. 1. Kubernetes adds real reliability that you cannot get with raw Docker.The livenessProbe in our Ollama deployment means if the model crashes or hangs, Kubernetes detects it within 30 seconds and automatically restarts the pod. With a plain Docker container, you would need to notice the crash yourself and restart it manually. 2. RAM matters more than you think with LLMs.Mistral 7B requires 4.5GB of RAM and completely failed on my 8GB machine because the OS and Kubernetes overhead left only 3.1GB free. TinyLlama at 637MB ran perfectly. Always check model requirements before deploying — a model that cannot load is worse than no model. 3. One Helm command beats hours of configuration.The entire Prometheus + Grafana monitoring stack — which would take hours to configure manually — installed in a single helm install command. This is the power of the Kubernetes ecosystem. 4. The speed gap closes dramatically with a GPU.TinyLlama takes 20 seconds on a CPU. On an NVIDIA T4 GPU (available on GKE for ~$0.35/hour), the same model runs in under 1 second. If you need local + fast, the answer is a GPU node, not a bigger CPU model. 5. Port-forwarding is for development only.

In production you would use a Kubernetes Ingress with a real domain name and TLS certificate — not port-forwarding. Everything in this guide is the right foundation for production, but swap port-forwards for a proper Ingress before going live. Everything in this guide is available on GitHub: https://github.com/daksh777f/llm-on-kubernetes Deploying an open source LLM reliably is not just about running a model — it is about building infrastructure that handles failures gracefully, gives you visibility into what is happening, and scales when you need it to. The stack we built today — k3d + Ollama + Prometheus + Grafana + Next.js — is a genuine foundation for a production AI system. TinyLlama is free, private, and good enough for a wide range of tasks. When you need more power, swap it for Mistral or Llama 3 on a GPU node. The future of AI is not just API calls to commercial providers. It is open source models running on infrastructure you own and control. Now you know exactly how to build it. If this guide helped you, drop a star on the GitHub repo and share it with someone building with LLMs. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

choco install kubernetes-cli k3d kubernetes-helm nodejs -y choco install kubernetes-cli k3d kubernetes-helm nodejs -y choco install kubernetes-cli k3d kubernetes-helm nodejs -y k3d cluster create llm-cluster --agents 1 --port "8080:80@loadbalancer" k3d cluster create llm-cluster --agents 1 --port "8080:80@loadbalancer" k3d cluster create llm-cluster --agents 1 --port "8080:80@loadbalancer" kubectl get nodes kubectl get nodes kubectl get nodes apiVersion: v1 kind: Namespace metadata: name: llm --- apiVersion: apps/v1 kind: Deployment metadata: name: ollama namespace: llm spec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: containers: - name: ollama image: ollama/ollama:latest ports: - containerPort: 11434 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "4Gi" cpu: "2" livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: ollama-service namespace: llm labels: app: ollama spec: selector: app: ollama ports: - name: http port: 11434 targetPort: 11434 type: ClusterIP apiVersion: v1 kind: Namespace metadata: name: llm --- apiVersion: apps/v1 kind: Deployment metadata: name: ollama namespace: llm spec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: containers: - name: ollama image: ollama/ollama:latest ports: - containerPort: 11434 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "4Gi" cpu: "2" livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: ollama-service namespace: llm labels: app: ollama spec: selector: app: ollama ports: - name: http port: 11434 targetPort: 11434 type: ClusterIP apiVersion: v1 kind: Namespace metadata: name: llm --- apiVersion: apps/v1 kind: Deployment metadata: name: ollama namespace: llm spec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: containers: - name: ollama image: ollama/ollama:latest ports: - containerPort: 11434 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "4Gi" cpu: "2" livenessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /api/tags port: 11434 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: ollama-service namespace: llm labels: app: ollama spec: selector: app: ollama ports: - name: http port: 11434 targetPort: 11434 type: ClusterIP kubectl apply -f ollama.yaml kubectl apply -f ollama.yaml kubectl apply -f ollama.yaml kubectl get pods -n llm -w kubectl get pods -n llm -w kubectl get pods -n llm -w kubectl exec -n llm deployment/ollama -- ollama pull tinyllama kubectl exec -n llm deployment/ollama -- ollama pull tinyllama kubectl exec -n llm deployment/ollama -- ollama pull tinyllama kubectl port-forward -n llm svc/ollama-service 11434:11434 $body = '{"model":"tinyllama","prompt":"say hello","stream":false}' Invoke-RestMethod -Uri "http://127.0.0.1:11434/api/generate" ` -Method Post -ContentType "application/json" -Body $body kubectl port-forward -n llm svc/ollama-service 11434:11434 $body = '{"model":"tinyllama","prompt":"say hello","stream":false}' Invoke-RestMethod -Uri "http://127.0.0.1:11434/api/generate" ` -Method Post -ContentType "application/json" -Body $body kubectl port-forward -n llm svc/ollama-service 11434:11434 $body = '{"model":"tinyllama","prompt":"say hello","stream":false}' Invoke-RestMethod -Uri "http://127.0.0.1:11434/api/generate" ` -Method Post -ContentType "application/json" -Body $body helm repo add prometheus-community ` https://prometheus-community.github.io/helm-charts helm repo update helm repo add prometheus-community ` https://prometheus-community.github.io/helm-charts helm repo update helm repo add prometheus-community ` https://prometheus-community.github.io/helm-charts helm repo update helm install monitoring prometheus-community/kube-prometheus-stack ` --namespace monitoring ` --create-namespace ` --set grafana.adminPassword=admin123 ` --set alertmanager.enabled=false ` --set prometheus.prometheusSpec.retention=6h ` --set prometheus.prometheusSpec.resources.requests.memory=256Mi ` --set prometheus.prometheusSpec.resources.limits.memory=512Mi helm install monitoring prometheus-community/kube-prometheus-stack ` --namespace monitoring ` --create-namespace ` --set grafana.adminPassword=admin123 ` --set alertmanager.enabled=false ` --set prometheus.prometheusSpec.retention=6h ` --set prometheus.prometheusSpec.resources.requests.memory=256Mi ` --set prometheus.prometheusSpec.resources.limits.memory=512Mi helm install monitoring prometheus-community/kube-prometheus-stack ` --namespace monitoring ` --create-namespace ` --set grafana.adminPassword=admin123 ` --set alertmanager.enabled=false ` --set prometheus.prometheusSpec.retention=6h ` --set prometheus.prometheusSpec.resources.requests.memory=256Mi ` --set prometheus.prometheusSpec.resources.limits.memory=512Mi kubectl get pods -n monitoring kubectl get pods -n monitoring kubectl get pods -n monitoring kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 npx create-next-app@latest chatbot --typescript --tailwind --app --yes cd chatbot npx create-next-app@latest chatbot --typescript --tailwind --app --yes cd chatbot npx create-next-app@latest chatbot --typescript --tailwind --app --yes cd chatbot import { NextRequest, NextResponse } from "next/server"; export async function POST(req: NextRequest) { const { messages } = await req.json(); const prompt = messages .map((m: { role: string; content: string }) => m.role === "user" ? `User: ${m.content}` : `Assistant: ${m.content}` ) .join("\n") + "\nAssistant:"; const response = await fetch("http://127.0.0.1:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "tinyllama", prompt: prompt, stream: false, }), }); const data = await response.json(); return NextResponse.json({ response: data.response || "No response from model.", }); } import { NextRequest, NextResponse } from "next/server"; export async function POST(req: NextRequest) { const { messages } = await req.json(); const prompt = messages .map((m: { role: string; content: string }) => m.role === "user" ? `User: ${m.content}` : `Assistant: ${m.content}` ) .join("\n") + "\nAssistant:"; const response = await fetch("http://127.0.0.1:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "tinyllama", prompt: prompt, stream: false, }), }); const data = await response.json(); return NextResponse.json({ response: data.response || "No response from model.", }); } import { NextRequest, NextResponse } from "next/server"; export async function POST(req: NextRequest) { const { messages } = await req.json(); const prompt = messages .map((m: { role: string; content: string }) => m.role === "user" ? `User: ${m.content}` : `Assistant: ${m.content}` ) .join("\n") + "\nAssistant:"; const response = await fetch("http://127.0.0.1:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "tinyllama", prompt: prompt, stream: false, }), }); const data = await response.json(); return NextResponse.json({ response: data.response || "No response from model.", }); } npm run dev -- --port 3001 npm run dev -- --port 3001 npm run dev -- --port 3001 - k3d — Lightweight Kubernetes that runs entirely inside Docker. No cloud account needed. - Ollama — A tool that serves open source LLMs as a REST API inside a container. - TinyLlama — A 1.1B parameter open source model that runs in 637MB of RAM. Perfect for 8GB machines. - Prometheus + Grafana — Industry standard monitoring stack. Prometheus scrapes metrics, Grafana visualizes them. - Next.js — The chatbot frontend. Calls Ollama through an API route. - Docker Desktop (running) - Left sidebar → Dashboards → New → Import - Enter ID 15661 → Load - Select Prometheus as data source → Import - Dashboards → New → New Dashboard → Add visualization - Select Prometheus as data source - Switch to Code mode and enter: kube_pod_info{namespace="llm"} - Change visualization to Stat - Title: Ollama LLM Pod Status - Save dashboard as LLM Monitoring - Identity questions ("Which model are you?") - Technical explanations ("What is Kubernetes?", "What is Docker?") - Coding tasks ("Write a Python function to reverse a string") - Creative tasks ("Write a haiku about programming") - General knowledge ("What is the capital of Australia?") - Data privacy is non-negotiable - You are cost-sensitive at scale - Tasks are simple: Q&A, summarization, basic code generation - You want full control over your AI infrastructure - You are building a product and do not want API dependency - Response speed matters (customer-facing, real-time) - Tasks need strong reasoning or latest knowledge - You are prototyping and do not want infra overhead - Reliability SLAs matter more than cost - Add a GPU node — Deploy to GKE with a T4 GPU and watch TinyLlama go from 20s to under 1s - Try larger models — With a GPU, Mistral 7B or Llama 3 8B will fit comfortably - Add streaming responses — Ollama supports streaming; the chatbot can show tokens as they generate instead of waiting - Set up real alerting — Configure Grafana alerts to send a Slack message when the Ollama pod goes down - Deploy to cloud — Replace k3d with GKE or EKS for a production cluster with real uptime guarantees - ollama.yaml — Kubernetes deployment for Ollama - chatbot/ — Complete Next.js chatbot - compare.py — The comparison script

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsdeploysourcereliablykubernetesintroductionrce

More from Tools

Tools: I wrote 1 byte to a 10GB file and used almost no disk space (2026)

2026-04-19 0

Tools: Let AI fix your CI" is a supply chain attack waiting to happen. Here's how to do it safely - Expert Insights

2026-04-19 0

Tools: Automating MySQL InnoDB Cluster Deployment for HPE Morpheus Enterprise HA - Expert Insights

2026-04-19 0

Tools: Complete Guide to Why Your Lab Domain Suddenly Stopped Resolving (DNS Blocklists)

2026-04-19 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: How to Deploy an Open Source LLM Reliably on Kubernetes (Step-by-Step) (2026)

Introduction**

What We Are Building

Prerequisites

Step 1: Create the Kubernetes Cluster

Step 2: Deploy Ollama + TinyLlama

Step 3: Monitoring with Prometheus and Grafana

Import the Kubernetes Dashboard

Create a Custom Ollama Panel

Step 4: Build the Chatbot UI

Results

The Real Story Behind These Numbers

When to Use Each

Lessons Learned

What to Try Next

Full Code

🏷️ Tags

More from Tools

Tools: I wrote 1 byte to a 10GB file and used almost no disk space (2026)

Tools: Let AI fix your CI" is a supply chain attack waiting to happen. Here's how to do it safely - Expert Insights

Tools: Automating MySQL InnoDB Cluster Deployment for HPE Morpheus Enterprise HA - Expert Insights

Tools: Complete Guide to Why Your Lab Domain Suddenly Stopped Resolving (DNS Blocklists)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting