Tools: Ollama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0

Tools: Ollama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0

Ollama + Open WebUI Self-Hosting Guide 2026 — Run Your Own AI for $0

Why Self-Host Your Own AI in 2026

The Cost Argument

The Privacy Argument

The Learning Argument

What Is Ollama + Open WebUI

Ollama: The Model Runtime

Open WebUI: The Chat Interface

Path A: Local Setup on Mac/Linux (5-Minute Quickstart)

Step 1: Install Ollama

Step 2: Pull and Run a Model

Step 3: Start the Ollama Server

Step 4: Install Open WebUI

Hardware Requirements for Local

Path B: VPS Deployment on Hetzner (~$5/month)

Why Hetzner

VPS Setup with Docker Compose

Setting Up HTTPS with a Reverse Proxy

VPS Performance Expectations

Model Recommendations by Use Case

For Coding

For Writing and Chat

For Multilingual Use

Model Comparison Table

Open WebUI Features Worth Configuring

System Prompts

Document Upload (RAG)

Multi-User Access

API Access

Performance Expectations: An Honest Assessment

What Works Well

What Does Not Work Well

Throughput Numbers

When to Use API Services Instead

Our Experience at Effloow

Quick Reference: Getting Started in Under 10 Minutes

Conclusion ChatGPT Pro costs $200 a month. Claude Pro costs $20. Even the budget API tiers add up once you start building real workflows. There is another option: run your own AI locally or on a cheap VPS, with a ChatGPT-style interface, for $0 to $5 a month. No API keys. No usage limits. No data leaving your machine. The stack is Ollama for running models and Open WebUI for the browser interface. Ollama has hit 52 million monthly downloads as of Q1 2026 — this is not experimental software anymore. Open WebUI gives you a polished chat interface with conversation history, model switching, document upload, and multi-user support. This guide covers two paths: a 5-minute local setup for your Mac or Linux machine, and a VPS deployment on Hetzner for when you want 24/7 availability without running your laptop all day. We will be honest about what works well, what does not, and when you should just use an API instead. Before diving into setup, it helps to understand why self-hosting has become practical this year — and when it actually makes sense. API pricing in early 2026 looks like this: Self-hosting on a consumer GPU like an RTX 4090 runs roughly $30-80/month in electricity after the initial hardware purchase. On a local Mac with Apple Silicon, power consumption is under $15/month. On a budget VPS, you are looking at $5-8/month for CPU-only inference. The catch: self-hosted models are generally smaller and less capable than frontier API models. You are not replacing GPT-5 with a 7B model running on a $4 VPS. You are replacing it for specific tasks where a smaller model is good enough — drafting, summarization, code completion, local RAG, and casual conversation. Every prompt you send to an API leaves your network. For some workflows — medical notes, client data, proprietary code, legal documents — that is a non-starter regardless of the provider's privacy policy. Self-hosted inference keeps everything local. Your prompts never leave your machine or your VPS. This is not a theoretical benefit: it is a compliance requirement for many teams. Understanding how LLM inference actually works — model loading, quantization, context windows, memory management — makes you a better AI engineer. Self-hosting forces that understanding in a way that API calls never will. We covered a similar philosophy in our guide to self-hosting your entire dev stack for under $20/month. Ollama fits perfectly into that same infrastructure-as-education mindset. Ollama is an open-source tool that makes running LLMs locally as simple as ollama run llama3. It handles model downloading, quantization, GPU/CPU allocation, and exposes an OpenAI-compatible API at localhost:11434. Key facts about Ollama in 2026: Open WebUI is a self-hosted web interface that connects to Ollama (or any OpenAI-compatible API) and gives you a ChatGPT-like experience in your browser. Together, Ollama + Open WebUI give you a private, self-hosted ChatGPT alternative that you fully control. This is the fastest way to get running. No Docker required, no server configuration, no monthly bill. Download from ollama.com or use Homebrew: That is it. You now have a local LLM running in your terminal. Type a prompt, get a response. For Open WebUI to connect, Ollama needs to run as a background server: This starts the API on http://localhost:11434. You can verify it works: The simplest method is Docker: Open http://localhost:3000 in your browser. Create an account (the first account becomes admin), and you will see your Ollama models ready to chat with. How much can your machine handle? Here is a practical guide: Rule of thumb: roughly 0.5 GB of VRAM per billion parameters with 4-bit quantization. Full precision (FP16) doubles that requirement. Apple Silicon Macs are particularly good for local LLM work because they share unified memory between CPU and GPU. An M2 Pro with 32 GB can comfortably run 32B models. An M4 Max with 128 GB can handle 70B models at 12 tokens/second. Not everyone wants to keep a laptop running 24/7. A VPS gives you always-on access from any device — your phone, a tablet, any browser. The trade-off: VPS servers at this price point have no GPU, so inference is CPU-only. This means smaller models and slower generation. But for many use cases — quick questions, writing assistance, code review, document summarization — a 3B or 7B model on CPU is perfectly usable. We use Hetzner for most of our self-hosted infrastructure at Effloow, as we described in our self-hosting dev stack guide. The reasons are the same here: For the full Hetzner server lineup including GPU options, see our Hetzner Cloud for AI Projects guide. The CX23 can run a 3B model with CPU inference. The CX33 handles 7-8B models. For larger models, you would need a dedicated server with more RAM, which pushes the cost above $20/month. If you want to run these containers alongside other services (Gitea, Coolify, monitoring), check our comparison of Coolify vs Dokploy for managing deployments on a single server. SSH into your Hetzner server and create a project directory: Create docker-compose.yml: Pull a model appropriate for your VPS: For remote access, you need HTTPS. The simplest approach is Caddy: Edit /etc/caddy/Caddyfile: Caddy handles SSL certificates automatically. Your Open WebUI is now accessible at https://chat.yourdomain.com. Be realistic about what CPU inference delivers: CPU inference is measured in single-digit tokens per second, not the 30-50 tok/s you get with a local GPU. This is fine for asynchronous workflows — ask a question, do something else, come back to the answer. It is not great for rapid-fire interactive chat. Choosing the right model matters more than choosing the right hardware. Here is what works well in Ollama as of early 2026: All models listed use permissive licenses (Apache 2.0, MIT, or Llama Community License) that allow commercial use. Once your stack is running, these settings improve the experience significantly: Set default system prompts per model via Settings > Models. This lets you configure a coding assistant persona for your coding model and a writing assistant persona for your writing model. Open WebUI supports uploading PDFs, text files, and other documents for retrieval-augmented generation. Upload a document, and the model can answer questions about its contents. This works well with models 7B and above. If you are deploying on a VPS for a small team, Open WebUI supports multiple user accounts with role-based access. The first registered user becomes admin and can invite others. Open WebUI also exposes its own API, letting you programmatically interact with your models from scripts, CI pipelines, or other tools. Pair it with a self-hosted automation platform like n8n and you can build AI-powered workflows entirely on your own infrastructure — we compare the options in our Zapier vs Make vs n8n guide. Self-hosted LLMs have real limitations. Here is what to expect: From recent benchmarks: Ollama is not a production inference server. It is a personal/dev tool. If you need multi-user production inference, look at vLLM, TGI, or managed services. Self-hosting is not always the right choice. Use API services when: The smart approach for many teams is hybrid: self-host for privacy-sensitive tasks and high-volume simple queries, use APIs for complex reasoning and peak demand. We explored this infrastructure thinking in our article about how we built Effloow with 14 AI agents — we use a mix of API and self-hosted tools depending on the task. We run Ollama internally for specific workflows: For production content that needs to be high quality — like the articles on this site — we still use frontier API models. The quality gap between a self-hosted 7B and Claude Opus 4 is real and significant for long-form writing. The setup runs alongside our other self-hosted tools. If you are already running a VPS with Coolify or Dokploy (as described in our Coolify vs Dokploy comparison), adding Ollama is just another Docker Compose service. Total time: 5-10 minutes. Total ongoing cost: $0 (local) or ~$5/month (VPS). Self-hosting your own AI with Ollama and Open WebUI is no longer a weekend hack project. It is a legitimate, stable option for developers and small teams who want privacy, cost control, and the educational value of understanding how LLM inference works. The stack is simple: Ollama runs models, Open WebUI provides the interface. Five minutes of setup on a local machine, or a Docker Compose file on a cheap VPS. Will it replace ChatGPT Pro or Claude for complex work? No. But for the 80% of AI queries that do not need frontier intelligence — quick questions, drafting, code review, document analysis — it is free, private, and entirely under your control. Start with a 3B model on whatever hardware you have. If you find yourself using it daily, upgrade to a bigger model or a dedicated VPS. The infrastructure scales with your needs, not your credit card. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">brew -weight: 500;">install ollama -weight: 500;">brew -weight: 500;">install ollama -weight: 500;">brew -weight: 500;">install ollama -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Pull a model (one-time download) ollama pull llama3.2:8b # Run it interactively ollama run llama3.2:8b # Pull a model (one-time download) ollama pull llama3.2:8b # Run it interactively ollama run llama3.2:8b # Pull a model (one-time download) ollama pull llama3.2:8b # Run it interactively ollama run llama3.2:8b ollama serve ollama serve ollama serve -weight: 500;">curl http://localhost:11434/api/tags -weight: 500;">curl http://localhost:11434/api/tags -weight: 500;">curl http://localhost:11434/api/tags -weight: 500;">docker run -d \ -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ---weight: 500;">restart always \ ghcr.io/open-webui/open-webui:main -weight: 500;">docker run -d \ -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ---weight: 500;">restart always \ ghcr.io/open-webui/open-webui:main -weight: 500;">docker run -d \ -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ---weight: 500;">restart always \ ghcr.io/open-webui/open-webui:main mkdir -p ~/ollama-stack && cd ~/ollama-stack mkdir -p ~/ollama-stack && cd ~/ollama-stack mkdir -p ~/ollama-stack && cd ~/ollama-stack version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=2 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=10m # Remove the deploy section if your VPS has no GPU # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: all # capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=2 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=10m # Remove the deploy section if your VPS has no GPU # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: all # capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=2 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=10m # Remove the deploy section if your VPS has no GPU # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: all # capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d # For CX23 (4 GB RAM) — use a 3B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # For CX33 (8 GB RAM) — you can try a 7B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:8b # For CX23 (4 GB RAM) — use a 3B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # For CX33 (8 GB RAM) — you can try a 7B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:8b # For CX23 (4 GB RAM) — use a 3B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # For CX33 (8 GB RAM) — you can try a 7B model -weight: 500;">docker exec -it ollama ollama pull llama3.2:8b # Install Caddy -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y caddy # Install Caddy -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y caddy # Install Caddy -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y caddy chat.yourdomain.com { reverse_proxy localhost:3000 } chat.yourdomain.com { reverse_proxy localhost:3000 } chat.yourdomain.com { reverse_proxy localhost:3000 } -weight: 600;">sudo -weight: 500;">systemctl reload caddy -weight: 600;">sudo -weight: 500;">systemctl reload caddy -weight: 600;">sudo -weight: 500;">systemctl reload caddy # 1. Install Ollama -weight: 500;">brew -weight: 500;">install ollama # macOS # -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Linux # 2. Pull a model ollama pull llama3.2:8b # 3. Start the server ollama serve # 4. Run Open WebUI -weight: 500;">docker run -d -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main # 5. Open http://localhost:3000 # 1. Install Ollama -weight: 500;">brew -weight: 500;">install ollama # macOS # -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Linux # 2. Pull a model ollama pull llama3.2:8b # 3. Start the server ollama serve # 4. Run Open WebUI -weight: 500;">docker run -d -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main # 5. Open http://localhost:3000 # 1. Install Ollama -weight: 500;">brew -weight: 500;">install ollama # macOS # -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Linux # 2. Pull a model ollama pull llama3.2:8b # 3. Start the server ollama serve # 4. Run Open WebUI -weight: 500;">docker run -d -p 3000:8080 \ --add-host=host.-weight: 500;">docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main # 5. Open http://localhost:3000 # 1. SSH into your server ssh root@your-server-ip # 2. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 3. Create -weight: 500;">docker-compose.yml (see VPS section above) # 4. Start the stack -weight: 500;">docker compose up -d # 5. Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # 1. SSH into your server ssh root@your-server-ip # 2. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 3. Create -weight: 500;">docker-compose.yml (see VPS section above) # 4. Start the stack -weight: 500;">docker compose up -d # 5. Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # 1. SSH into your server ssh root@your-server-ip # 2. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 3. Create -weight: 500;">docker-compose.yml (see VPS section above) # 4. Start the stack -weight: 500;">docker compose up -d # 5. Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b - 52 million monthly downloads (Q1 2026) — 520x growth from Q1 2023 - 135,000+ GGUF models available on HuggingFace - Runs on macOS (Apple Silicon), Linux, and Windows - Exposes an OpenAI-compatible API, so existing code that talks to OpenAI can point at Ollama instead - Default limit of ~4 parallel requests — designed for personal/small team use, not production multi-user deployments - Chat interface with conversation history and search - Model switching (swap between models mid-conversation) - Document upload and RAG (retrieval-augmented generation) - Multi-user support with role-based access - Prompt templates and system message customization - Mobile-friendly responsive design - Local file and image analysis - CX23: 2 vCPU, 4 GB RAM, 40 GB SSD — €3.99/month (~$4.99/month) - CX33: 4 vCPU, 8 GB RAM, 80 GB SSD — €6.49/month (~$8.09/month) - EU data centers (Falkenstein, Helsinki) for GDPR compliance - Flat monthly pricing with no bandwidth surprises - Qwen 2.5 Coder 7B — Best balance of code quality and resource usage. Handles Python, JavaScript, TypeScript, Go, and Rust well. If you are comparing self-hosted coding models against paid alternatives, see our AI coding tools pricing breakdown for the full cost picture. - DeepSeek Coder V2 (distilled) — Strong at multi-file reasoning and debugging. Needs more RAM. - Llama 3.2 8B — Decent general coding, but specialized coding models outperform it. - Llama 3.3 70B — Best open-source general model if you have the hardware (40+ GB RAM). - Qwen 2.5 32B — Excellent writing quality, 83.2% MMLU score. Needs 16-20 GB. - Gemma 2 9B — Surprisingly good writing quality for its size. Runs on 6 GB. - Llama 3.2 3B — Solid general capability in a tiny package. Best first model to try. - Qwen 3.5 7B — 76.8% MMLU, 3x faster than the 32B variant. Great quality-per-watt. - Phi-4 14B — Microsoft's efficient model. Good for development workflows if you have 10 GB. - Qwen 2.5 series — Supports 29+ languages natively. Best option for non-English work. - Personal assistant for writing, brainstorming, and summarization — A 7-8B model handles these tasks surprisingly well. - Code completion and review — Specialized coding models match or beat older GPT-3.5-level performance. - Private document analysis — Upload sensitive documents and query them without data leaving your network. - Learning and experimentation — Try different models, fine-tune for specific tasks, understand how LLMs work. - Complex multi-step reasoning — Smaller models struggle with tasks that GPT-5 or Claude Opus handle easily. - Multi-user production deployments — Ollama caps at ~4 parallel requests by default. It is designed for personal or small-team use. - Speed-critical applications — CPU inference on a VPS delivers single-digit tokens per second. GPU inference on consumer hardware delivers 30-50 tok/s. API services deliver hundreds of tokens per second. - Long context windows — Most open models cap at 8K-32K tokens. Frontier API models offer 128K-200K+ tokens. - Ollama: ~41 tokens/second (single user, GPU) - llama.cpp (CPU): ~80 tokens/second (optimized build, good CPU) - vLLM: ~793 tokens/second (production deployment, A100 GPU) - You need frontier intelligence. GPT-5, Claude Opus 4, and Gemini 2.5 Pro are significantly more capable than any model you can self-host. For complex reasoning, creative work, or tasks where quality is non-negotiable, APIs win. - Your volume is low. Below ~2 million tokens per day, API services are almost always cheaper than self-hosting once you factor in infrastructure, maintenance, and your time. - You need high concurrency. If multiple people need simultaneous access with low latency, API services handle this natively. Self-hosted Ollama does not. - Uptime matters. API providers offer 99.9%+ uptime with automatic failover. Your Hetzner VPS does not. - Draft generation — First drafts of articles and documentation where privacy is not critical but cost adds up at volume. - Code review assistance — Quick code reviews and refactoring suggestions where a 7B coding model is sufficient. - Local RAG — Querying internal documents without sending proprietary content to external APIs.