Tools: How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

Tools: How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

First Question: Wait, Is This Actually Free?

So What Exactly Is Ollama?

Downloading Your First Model — Which One?

The Architecture in Plain English

Context Windows — What Are They and Why Do They Matter?

Can Local Models Search the Internet?

Connecting Claude Code to Gemma4

The Python API Test

Setting Up Open WebUI — The ChatGPT-Like Interface

Installing Docker Desktop

Running Open WebUI

First Real Test — Image to Text Extraction

Document Upload and RAG — How It Actually Works

The Full Stack — What I Now Have Running

When to Use What

Honest Reflections By Pranaychandra Ravi It started with a YouTube Short. Someone on my feed casually demonstrated connecting a local AI model to Claude Code and I stopped mid-scroll. No API key. No subscription. No code leaving their machine. I had to know how it worked. What followed was a deep dive into local AI — Ollama, Gemma4, Docker, Open WebUI, vector databases, context windows, and a Python script that made my local model generate an ASCII diagram of the Earth and Moon. This post documents everything I learned, every question I asked, and every mistake I made along the way. If you're curious about running AI entirely on your own hardware, this one is for you. My first instinct was skepticism. Claude Code is Anthropic's product. Surely using it requires a Claude subscription? The short answer is no — not when you pair it with Ollama and a local model. Here's what I learned: Claude Code is the agent — the tool that reads your files, runs commands, edits code, and manages multi-step tasks in your terminal. By default it calls Anthropic's API, which costs money. But Claude Code exposes environment variables that let you redirect those API calls anywhere you want — including a local Ollama server running on your own machine. Ollama added official support for Anthropic's Messages API format, meaning Claude Code can talk to it natively. No hacks, no middleware, no subscription. The only cost is your own electricity and hardware. Before I could set anything up I needed to understand what Ollama actually is, because "install Ollama" doesn't tell you much. Think of Ollama as two things in one: 1. A model manager — it downloads, stores, and organizes AI models on your machine. Like a package manager but for AI brains. 2. A local API server — once running, it exposes an endpoint at http://localhost:11434 that any application can call. Your code, Claude Code, Open WebUI, VS Code extensions — anything that speaks the Anthropic or OpenAI API format can connect to it. This is the key insight I kept coming back to: Ollama itself has no intelligence. It's an empty engine. You have to download a model — a large file containing all the AI's weights and knowledge — before anything useful happens. This is where hardware matters. I have: With an NVIDIA card, Ollama automatically uses CUDA — no setup needed. Your GPU handles inference and it's dramatically faster than CPU-only. The key concept here is VRAM vs RAM: With 11GB VRAM I can fit most 7B–13B parameter models entirely in GPU memory, which means fast, snappy responses. After thinking through my use cases — coding help, image analysis, document review — I landed on Gemma4 (Google's multimodal model, ~12GB). Here's why it beat out alternatives like Qwen3.6 (28GB): My use cases included image-to-text extraction and converting images to coloring pages — Qwen3.6 can't do either because it's text-only. Gemma4 won. One command. It downloads, verifies, and stores the model. You can see progress in the terminal. Before going further, I want to share the mental model that made everything click for me: Three different interfaces. One local model. Everything private. One of the most important concepts I clarified was the context window — the model's working memory. It's the maximum amount of text a model can "see" at once in a conversation. Exceed it and it starts forgetting the beginning. Here's the reality check comparison: Your VRAM directly affects how large a context window your local model can hold. More VRAM = more of the model loaded = bigger context available. You can manually increase it: For single documents, images, or focused coding tasks — perfectly fine. For analyzing six years of tax filings all at once? That's where Claude's 200k context is a genuine advantage local models can't match yet. Short answer: No, not by default. Local models are frozen at their training date. They have no internet connection during your conversation. This was an important distinction to understand. This raised an interesting follow-up question though. When I used Gemini to analyze my tax filing and it spotted mistakes — was it searching the internet to find them? No. And this was a real misconception I had. Gemini found tax errors because tax law, IRS rules, and common filing mistakes were baked into the model during training. It learned from millions of tax documents, accounting textbooks, and IRS publications. During your session it's not googling anything — it's applying trained knowledge to your specific document. Think of it like a tax accountant. They studied tax law for years. When reviewing your return they're not searching Google — they're applying what they already know to what you show them. Local models work the same way. The difference is: For sensitive financial documents, that privacy trade-off is significant. This was surprisingly simple. Claude Code reads three environment variables: Or using Ollama's built-in launcher: When Claude Code started up I saw this at the bottom of the welcome screen: That confirms it's using Gemma4 through Ollama. No Anthropic billing. No subscription. What you get with this setup: Before setting up a GUI I wanted to confirm the raw API worked. Here's the script I wrote: Gemma4, running entirely on my machine, responding to a Python script. No API key. No internet. Completely local. This was the moment it really clicked. For a proper GUI I went with Open WebUI — a beautiful, feature-rich interface that runs locally and connects to Ollama. First attempt using pip failed because I had Python 3.13 and Open WebUI requires Python 3.11 or 3.12: So I went the Docker route instead. Docker Desktop is free for personal use. Download from docker.com/products/docker-desktop. During install, WSL 2 backend gets configured automatically on Windows. I initially tried -p 3000:80 which caused a port conflict (another process was using port 3000 on my machine). Switching to -p 127.0.0.1:3000:8080 fixed it. Confirmed it was running: Then opened http://localhost:3000 in Chrome and saw the Open WebUI interface with Gemma4 auto-detected. One of the reasons I picked Gemma4 over Qwen3.6 was its multimodal capability — it can actually see images. I put this to the test immediately. I had a photo of handwritten chess notes and uploaded it directly into the Open WebUI chat. The prompt was simple: "convert this image to text". Gemma4 thought for 11 seconds and returned: That's a perfect transcription of handwritten text — extracted entirely locally, no cloud OCR service, no API key, nothing leaving my machine. It even generated a relevant follow-up suggestion: "Are there other kinds of tactical attacks besides forks, like pins or skewers?" This is the multimodal capability in action: For anyone with scanned documents, handwritten notes, receipts, or any image containing text — this works out of the box with Gemma4 in Open WebUI. One of the most powerful features of Open WebUI is document upload with RAG (Retrieval Augmented Generation). This is how you can upload your AWS docs, tax returns, or any PDFs and chat with them. Here's what happens under the hood: Everything is stored locally at: Your documents never leave your machine. ChromaDB is completely free and open source. One important limitation: RAG finds relevant chunks, not the entire document. If an answer spans many sections of a large document, it might miss some context. The workaround is to upload smaller, focused documents rather than one giant PDF. Total monthly cost: $0 After going through all of this, here's the practical split I settled on: What surprised me: How straightforward the setup actually was once I understood the mental model. Ollama is the server, the model is the brain, everything else just connects to it. What I underestimated: The quality gap between local models and Claude Sonnet/Opus is real. For simple tasks Gemma4 is impressive. For complex multi-step reasoning, Claude's frontier models are noticeably stronger. What I'd tell myself at the start: Local AI is not a replacement for cloud AI — it's a complement. Use local for private, repetitive, or experimental tasks. Use cloud AI for research, complex reasoning, and anything that benefits from a larger context window. The privacy win is real: For sensitive documents — financial records, personal data, proprietary code — local AI is genuinely better from a privacy standpoint. Your data does not leave your machine. Full stop. All of this runs on a Windows machine with 32GB RAM, an NVIDIA GPU with ~11GB VRAM, and a Core i9 processor. If you have similar hardware you can replicate this entire stack in an afternoon. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

Claude Code → talks to → Ollama (local server) → runs → Your model (no Anthropic servers involved) Claude Code → talks to → Ollama (local server) → runs → Your model (no Anthropic servers involved) Claude Code → talks to → Ollama (local server) → runs → Your model (no Anthropic servers involved) Without a model: Ollama = empty server, useless With a model: Ollama = fully local AI, free forever Without a model: Ollama = empty server, useless With a model: Ollama = fully local AI, free forever Without a model: Ollama = empty server, useless With a model: Ollama = fully local AI, free forever Model fits in VRAM → GPU handles everything → Very fast ✅ Model too big for VRAM → spills into system RAM → Slower ⚠️ Model fits in VRAM → GPU handles everything → Very fast ✅ Model too big for VRAM → spills into system RAM → Slower ⚠️ Model fits in VRAM → GPU handles everything → Very fast ✅ Model too big for VRAM → spills into system RAM → Slower ⚠️ ollama pull gemma4 ollama pull gemma4 ollama pull gemma4 ┌─────────────────────────────────────────────────────┐ │ YOUR COMPUTER │ │ │ │ ┌─────────────┐ ┌──────────────┐ │ │ │ Claude Code │───▶│ Ollama │ │ │ │ (terminal) │ │ :11434 (API) │ │ │ └─────────────┘ └──────┬───────┘ │ │ │ │ │ ┌─────────────┐ ┌──────▼───────┐ │ │ │ Open WebUI │───▶│ Gemma4 │ │ │ │ (browser) │ │ (the brain) │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ │ │ │ Python API │───▶ http://localhost:11434 │ │ │ scripts │ │ │ └─────────────┘ │ └─────────────────────────────────────────────────────┘ Zero data leaves your machine ┌─────────────────────────────────────────────────────┐ │ YOUR COMPUTER │ │ │ │ ┌─────────────┐ ┌──────────────┐ │ │ │ Claude Code │───▶│ Ollama │ │ │ │ (terminal) │ │ :11434 (API) │ │ │ └─────────────┘ └──────┬───────┘ │ │ │ │ │ ┌─────────────┐ ┌──────▼───────┐ │ │ │ Open WebUI │───▶│ Gemma4 │ │ │ │ (browser) │ │ (the brain) │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ │ │ │ Python API │───▶ http://localhost:11434 │ │ │ scripts │ │ │ └─────────────┘ │ └─────────────────────────────────────────────────────┘ Zero data leaves your machine ┌─────────────────────────────────────────────────────┐ │ YOUR COMPUTER │ │ │ │ ┌─────────────┐ ┌──────────────┐ │ │ │ Claude Code │───▶│ Ollama │ │ │ │ (terminal) │ │ :11434 (API) │ │ │ └─────────────┘ └──────┬───────┘ │ │ │ │ │ ┌─────────────┐ ┌──────▼───────┐ │ │ │ Open WebUI │───▶│ Gemma4 │ │ │ │ (browser) │ │ (the brain) │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ │ │ │ Python API │───▶ http://localhost:11434 │ │ │ scripts │ │ │ └─────────────┘ │ └─────────────────────────────────────────────────────┘ Zero data leaves your machine ollama run gemma4 --ctx-size 32768 ollama run gemma4 --ctx-size 32768 ollama run gemma4 --ctx-size 32768 Claude (this chat) → Has web search tool → Knows current events ✅ Gemma4 (local) → No internet → Knowledge frozen at training ❌ Claude (this chat) → Has web search tool → Knows current events ✅ Gemma4 (local) → No internet → Knowledge frozen at training ❌ Claude (this chat) → Has web search tool → Knows current events ✅ Gemma4 (local) → No internet → Knowledge frozen at training ❌ export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_API_KEY="" export ANTHROPIC_BASE_URL=http://localhost:11434 export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_API_KEY="" export ANTHROPIC_BASE_URL=http://localhost:11434 export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_API_KEY="" export ANTHROPIC_BASE_URL=http://localhost:11434 ollama launch claude ollama launch claude ollama launch claude gemma4 · API Usage Billing · [email protected]'s Organization gemma4 · API Usage Billing · [email protected]'s Organization gemma4 · API Usage Billing · [email protected]'s Organization import requests def chat(prompt): response = requests.post( "http://localhost:11434/api/generate", json={ "model": "gemma4", "prompt": prompt, "stream": False } ) return response.json()["response"] print(chat("Write a hello world in ascii diagram of moon and earth")) import requests def chat(prompt): response = requests.post( "http://localhost:11434/api/generate", json={ "model": "gemma4", "prompt": prompt, "stream": False } ) return response.json()["response"] print(chat("Write a hello world in ascii diagram of moon and earth")) import requests def chat(prompt): response = requests.post( "http://localhost:11434/api/generate", json={ "model": "gemma4", "prompt": prompt, "stream": False } ) return response.json()["response"] print(chat("Write a hello world in ascii diagram of moon and earth")) ( ) / \ ----(---O---) (------) <-- Orbit Path / / \ / / \ | | | | | | | ( ) / \ ----(---O---) (------) <-- Orbit Path / / \ / / \ | | | | | | | ( ) / \ ----(---O---) (------) <-- Orbit Path / / \ / / \ | | | | | | | ERROR: Could not find a version that satisfies the requirement open-webui ERROR: Could not find a version that satisfies the requirement open-webui ERROR: Could not find a version that satisfies the requirement open-webui docker run -d ` -p 127.0.0.1:3000:8080 ` --name open-webui ` -v open-webui:/app/backend/data ` --add-host=host.docker.internal:host-gateway ` ghcr.io/open-webui/open-webui:main docker run -d ` -p 127.0.0.1:3000:8080 ` --name open-webui ` -v open-webui:/app/backend/data ` --add-host=host.docker.internal:host-gateway ` ghcr.io/open-webui/open-webui:main docker run -d ` -p 127.0.0.1:3000:8080 ` --name open-webui ` -v open-webui:/app/backend/data ` --add-host=host.docker.internal:host-gateway ` ghcr.io/open-webui/open-webui:main netstat -ano | findstr :3000 # TCP 0.0.0.0:3000 LISTENING ← Docker up and running curl http://localhost:3000 # StatusCode: 200 OK ← Server responding netstat -ano | findstr :3000 # TCP 0.0.0.0:3000 LISTENING ← Docker up and running curl http://localhost:3000 # StatusCode: 200 OK ← Server responding netstat -ano | findstr :3000 # TCP 0.0.0.0:3000 LISTENING ← Docker up and running curl http://localhost:3000 # StatusCode: 200 OK ← Server responding FORK/DOUBLE ATTACK When we attack two or more pieces at the same time then it is known as fork or double attack Note- Knights are good at making fork. FORK/DOUBLE ATTACK When we attack two or more pieces at the same time then it is known as fork or double attack Note- Knights are good at making fork. FORK/DOUBLE ATTACK When we attack two or more pieces at the same time then it is known as fork or double attack Note- Knights are good at making fork. You upload PDF ↓ Open WebUI splits it into chunks ↓ Converts chunks to embeddings (mathematical vectors) ↓ Stores in ChromaDB (local vector database) ↓ You ask a question ↓ ChromaDB finds the most relevant chunks ↓ Sends chunks to Gemma4 as context ↓ Gemma4 answers based on YOUR document You upload PDF ↓ Open WebUI splits it into chunks ↓ Converts chunks to embeddings (mathematical vectors) ↓ Stores in ChromaDB (local vector database) ↓ You ask a question ↓ ChromaDB finds the most relevant chunks ↓ Sends chunks to Gemma4 as context ↓ Gemma4 answers based on YOUR document You upload PDF ↓ Open WebUI splits it into chunks ↓ Converts chunks to embeddings (mathematical vectors) ↓ Stores in ChromaDB (local vector database) ↓ You ask a question ↓ ChromaDB finds the most relevant chunks ↓ Sends chunks to Gemma4 as context ↓ Gemma4 answers based on YOUR document C:\Users\lavan\AppData\Roaming\open-webui\data\ 📁 vector_db ← document embeddings (ChromaDB) 📁 uploads ← original files 📄 webui.db ← chat history (SQLite) C:\Users\lavan\AppData\Roaming\open-webui\data\ 📁 vector_db ← document embeddings (ChromaDB) 📁 uploads ← original files 📄 webui.db ← chat history (SQLite) C:\Users\lavan\AppData\Roaming\open-webui\data\ 📁 vector_db ← document embeddings (ChromaDB) 📁 uploads ← original files 📄 webui.db ← chat history (SQLite) ✅ Ollama — model manager and local API server ✅ Gemma4 — the AI model (multimodal, ~12GB) ✅ Claude Code — agentic coding with local model ✅ Open WebUI — browser-based chat interface with document upload ✅ Python API — scripts calling the model directly ✅ Ollama — model manager and local API server ✅ Gemma4 — the AI model (multimodal, ~12GB) ✅ Claude Code — agentic coding with local model ✅ Open WebUI — browser-based chat interface with document upload ✅ Python API — scripts calling the model directly ✅ Ollama — model manager and local API server ✅ Gemma4 — the AI model (multimodal, ~12GB) ✅ Claude Code — agentic coding with local model ✅ Open WebUI — browser-based chat interface with document upload ✅ Python API — scripts calling the model directly - NVIDIA GPU with ~11GB VRAM - Core i9 processor - Gemini/Claude: More recent training data, larger knowledge base, up-to-date tax law changes - Gemma4 local: Good foundational knowledge, may be slightly behind on very recent rule changes, but your documents never leave your machine - ✅ File reading and editing across your project - ✅ Terminal command execution - ✅ Multi-step agentic coding tasks - ✅ Git operations - ✅ MCP connectors and plugins - ✅ Project context awareness - ⚠️ Intelligence capped at Gemma4's capability (weaker than Claude Sonnet/Opus) - ✅ Handwritten text extracted accurately - ✅ Context understood (chess notes) - ✅ Intelligent follow-up suggested - ✅ 100% local — image never left my PC - Ollama: ollama.com - Open WebUI: openwebui.com - Claude Code: claude.ai/code - Ollama + Claude Code docs: docs.ollama.com/integrations/claude-code - Docker Desktop (free): docker.com/products/docker-desktop