Tools: Project: Run a Local LLM for Coding (Zero Cloud, Zero API Keys) Weekend

Tools: Project: Run a Local LLM for Coding (Zero Cloud, Zero API Keys) Weekend

Why Local LLMs for Coding?

What You'll Need

Step 1: Install Ollama

Step 2: Pull a Coding Model

Step 3: Editor Integration

VS Code with Continue

Neovim with Ollama.nvim

Step 4: Terminal Integration

Performance Tuning

GPU Acceleration (NVIDIA)

Reduce Memory Usage

Speed vs Quality

Real-World Usage

Troubleshooting

What's Next I spent last weekend ditching cloud AI for coding. No more API rate limits, no more sending proprietary code to external servers, no more surprise bills. Just a local LLM running on my machine, integrated with my editor. Here's exactly how to set it up in an afternoon. Three reasons I made the switch: The trade-off? You need decent hardware and the models aren't quite GPT-4 level. But for code completion, refactoring, and explaining code? They're surprisingly good. No GPU? CPU inference works fine — just slower. I ran this on a 2-year-old laptop with no dedicated GPU and it was usable. Ollama is the easiest way to run local LLMs. One binary, no Python environment hell. That's it. Ollama runs as a local API server on port 11434. Not all models are equal for code. Here's what actually works: I use deepseek-coder:6.7b daily. It handles Python, TypeScript, Go, and Rust well. For quick completions, starcoder2:3b is snappier. Continue is my pick. Open source, actively maintained, works offline. Sometimes I just want to ask a quick question without leaving the terminal. Ollama auto-detects CUDA. Verify it's using your GPU: If not detected, ensure you have NVIDIA drivers and nvidia-container-toolkit installed. Loading multiple models eats RAM. Ollama keeps models in memory by default. To unload: For faster responses with slight quality drop, use quantized models: I use full precision for complex refactoring, quantized for quick completions. After a month with this setup, here's what works well: Still use cloud AI for: The local setup handles 80% of my daily AI coding needs. That's a win. "Model not found" — Run ollama list to see installed models. Pull again if missing. Slow responses — Try a smaller model or quantized version. Check if it's using GPU with --verbose. Out of memory — Close other apps, use a smaller model, or add swap space. Connection refused — Ensure ollama serve is running. Check nothing else is on port 11434. Once you're comfortable: The local LLM ecosystem is moving fast. Models that needed 64GB RAM two years ago now run on laptops. It's only getting better. More at dev.to/cumulus Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Linux/WSL -weight: 500;">curl -fsSL https://ollama.ai/-weight: 500;">install.sh | sh # macOS -weight: 500;">brew -weight: 500;">install ollama # Start the -weight: 500;">service ollama serve # Linux/WSL -weight: 500;">curl -fsSL https://ollama.ai/-weight: 500;">install.sh | sh # macOS -weight: 500;">brew -weight: 500;">install ollama # Start the -weight: 500;">service ollama serve # Linux/WSL -weight: 500;">curl -fsSL https://ollama.ai/-weight: 500;">install.sh | sh # macOS -weight: 500;">brew -weight: 500;">install ollama # Start the -weight: 500;">service ollama serve # Best balance of speed and quality (7B params, ~4GB) ollama pull deepseek-coder:6.7b # Faster, smaller, good for completions (3B params, ~2GB) ollama pull starcoder2:3b # Heavy hitter if you have the RAM (33B params, ~20GB) ollama pull codellama:34b # Best balance of speed and quality (7B params, ~4GB) ollama pull deepseek-coder:6.7b # Faster, smaller, good for completions (3B params, ~2GB) ollama pull starcoder2:3b # Heavy hitter if you have the RAM (33B params, ~20GB) ollama pull codellama:34b # Best balance of speed and quality (7B params, ~4GB) ollama pull deepseek-coder:6.7b # Faster, smaller, good for completions (3B params, ~2GB) ollama pull starcoder2:3b # Heavy hitter if you have the RAM (33B params, ~20GB) ollama pull codellama:34b ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists" ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists" ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists" { "models": [ { "title": "DeepSeek Coder", "provider": "ollama", "model": "deepseek-coder:6.7b" } ], "tabAutocompleteModel": { "title": "StarCoder", "provider": "ollama", "model": "starcoder2:3b" } } { "models": [ { "title": "DeepSeek Coder", "provider": "ollama", "model": "deepseek-coder:6.7b" } ], "tabAutocompleteModel": { "title": "StarCoder", "provider": "ollama", "model": "starcoder2:3b" } } { "models": [ { "title": "DeepSeek Coder", "provider": "ollama", "model": "deepseek-coder:6.7b" } ], "tabAutocompleteModel": { "title": "StarCoder", "provider": "ollama", "model": "starcoder2:3b" } } -- lazy.nvim { "nomnivore/ollama.nvim", dependencies = { "nvim-lua/plenary.nvim" }, cmd = { "Ollama", "OllamaModel" }, opts = { model = "deepseek-coder:6.7b", url = "http://127.0.0.1:11434", } } -- lazy.nvim { "nomnivore/ollama.nvim", dependencies = { "nvim-lua/plenary.nvim" }, cmd = { "Ollama", "OllamaModel" }, opts = { model = "deepseek-coder:6.7b", url = "http://127.0.0.1:11434", } } -- lazy.nvim { "nomnivore/ollama.nvim", dependencies = { "nvim-lua/plenary.nvim" }, cmd = { "Ollama", "OllamaModel" }, opts = { model = "deepseek-coder:6.7b", url = "http://127.0.0.1:11434", } } vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>") vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>") vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>") # Add to .bashrc/.zshrc ask() { ollama run deepseek-coder:6.7b "$*" } # Usage ask "What's the time complexity of Python's sorted()?" # Add to .bashrc/.zshrc ask() { ollama run deepseek-coder:6.7b "$*" } # Usage ask "What's the time complexity of Python's sorted()?" # Add to .bashrc/.zshrc ask() { ollama run deepseek-coder:6.7b "$*" } # Usage ask "What's the time complexity of Python's sorted()?" cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code" cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code" cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code" ollama run deepseek-coder:6.7b --verbose # Look for "using CUDA" in output ollama run deepseek-coder:6.7b --verbose # Look for "using CUDA" in output ollama run deepseek-coder:6.7b --verbose # Look for "using CUDA" in output # List loaded models -weight: 500;">curl http://localhost:11434/api/tags # Ollama auto-unloads after 5 min idle # Or -weight: 500;">restart the -weight: 500;">service to clear everything # List loaded models -weight: 500;">curl http://localhost:11434/api/tags # Ollama auto-unloads after 5 min idle # Or -weight: 500;">restart the -weight: 500;">service to clear everything # List loaded models -weight: 500;">curl http://localhost:11434/api/tags # Ollama auto-unloads after 5 min idle # Or -weight: 500;">restart the -weight: 500;">service to clear everything # q4 = 4-bit quantization, faster, less accurate ollama pull deepseek-coder:6.7b-instruct-q4_0 # q4 = 4-bit quantization, faster, less accurate ollama pull deepseek-coder:6.7b-instruct-q4_0 # q4 = 4-bit quantization, faster, less accurate ollama pull deepseek-coder:6.7b-instruct-q4_0 - Privacy — My client code never leaves my machine - Cost — $0/month after initial setup - Speed — No network latency, works offline - RAM: 16GB minimum, 32GB recommended - GPU: Optional but helps (NVIDIA with 8GB+ VRAM ideal) - Storage: 10-50GB depending on models - OS: Linux, macOS, or Windows with WSL2 - Install the Continue extension from VS Code marketplace - Open Continue settings (Cmd/Ctrl + Shift + P → "Continue: Open config.json") - Add your Ollama model: - Chat with code context (highlight code → ask questions) - Tab completions as you type - Inline edits (Cmd+I to refactor selected code) - Code completion and boilerplate - Explaining unfamiliar code - Writing tests for existing functions - Regex and SQL generation - Git commit messages - Complex architectural decisions - Multi-file refactoring - Debugging truly weird issues - Try different models — Mistral, Phi-3, Llama 3 all have coding variants - Fine-tune on your codebase — Ollama supports custom Modelfiles - Build custom tools — The Ollama API is dead simple to script against