$ -weight: 500;">brew -weight: 500;">install ollama
-weight: 500;">brew -weight: 500;">install ollama
-weight: 500;">brew -weight: 500;">install ollama
ollama serve
ollama serve
ollama serve
ollama run llama3.3
ollama run llama3.3
ollama run llama3.3
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama3.3", "prompt": "Hello, world!", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama3.3", "prompt": "Hello, world!", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama3.3", "prompt": "Hello, world!", "stream": false }'
ollama run llama3.3
ollama run llama3.3
ollama run llama3.3
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start ollama
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start ollama
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start ollama
ollama run llama3.3
ollama run llama3.3
ollama run llama3.3
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="not-needed"
) response = client.chat.completions.create( model="llama3.3", messages=[ {"role": "user", "content": "Explain recursion in simple terms"} ]
) print(response.choices[0].message.content)
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="not-needed"
) response = client.chat.completions.create( model="llama3.3", messages=[ {"role": "user", "content": "Explain recursion in simple terms"} ]
) print(response.choices[0].message.content)
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1", api_key="not-needed"
) response = client.chat.completions.create( model="llama3.3", messages=[ {"role": "user", "content": "Explain recursion in simple terms"} ]
) print(response.choices[0].message.content) - What runs well: Models up to 3B parameters (Llama 3.2 3B, Phi-4 Mini, Gemma 3 1B)
- Typical performance: 10-25 tokens per second on Apple Silicon; slower on older Intel/AMD CPUs
- Good for: Summarization, simple Q&A, code completion, text classification
- Limitations: Larger models will either not load or run painfully slowly - What runs well: Models up to 8B parameters at full quality, 14B models with quantization (Llama 3.3 8B, Mistral 7B, Phi-4 14B quantized, Gemma 3 12B)
- Typical performance: 15-40 tokens per second depending on model size and hardware
- Good for: Coding assistance, writing, research, document analysis, creative tasks
- Reality check: This tier handles 90% of what most people need from a local LLM - What runs well: Models up to 30B+ parameters (Llama 3.3 70B quantized, DeepSeek-R1 32B, Qwen 2.5 32B, Mixtral 8x7B)
- Typical performance: Varies widely; 70B models at Q4 quantization run at 5-15 tokens per second on a Mac Studio with 64GB unified memory
- Good for: Complex reasoning, long-form writing, code generation for entire features, multi-step analysis
- Note: A dedicated GPU (NVIDIA RTX 3090/4090 with 24GB VRAM) dramatically improves performance for these larger models on Windows and Linux - Dead-simple CLI: ollama run llama3.3 downloads and starts the model
- OpenAI-compatible API server built in — drop-in replacement for cloud APIs
- Huge model library with one-command downloads
- Lightweight, runs as a background -weight: 500;">service
- Works on Mac, Windows, and Linux - Beautiful GUI — no terminal required
- Built-in model discovery and download from Hugging Face
- Chat interface with conversation history
- Local API server for integrations
- Advanced configuration (quantization, context length, GPU layers) through the UI - Maximum performance — hand-optimized for Apple Silicon, AVX2, CUDA, and more
- Full control over quantization, context size, batch size, and inference parameters
- Supports GGUF model format — the standard for local models
- Active development with new optimizations landing weekly - LocalDocs feature indexes your files for retrieval-augmented generation (RAG)
- Works completely offline after initial setup
- Enterprise deployment options
- Simple, focused interface - Install Ollama. Open Terminal and run: - Start the Ollama -weight: 500;">service: - Pull and run a model: - Verify the API server is running at http://localhost:11434: - Download and -weight: 500;">install Ollama from ollama.com/download. Run the installer.
- Open PowerShell or Command Prompt: - For GPU acceleration, ensure you have the latest NVIDIA drivers installed. Ollama automatically detects and uses CUDA-capable GPUs. - Install with the official script: - Start the -weight: 500;">service: - Run a model: - For NVIDIA GPU support, -weight: 500;">install the NVIDIA Container Toolkit and CUDA drivers. Ollama detects them automatically. - Single-turn Q&A, summarization, and classification
- Code completion and generation for well-defined tasks
- Writing assistance (drafts, editing, brainstorming)
- Document analysis and extraction
- Private data processing
- Development and testing of AI-powered applications - Models above 70B parameters (GPT-4, Claude Opus, Gemini Ultra) offer reasoning depth that local models cannot match yet
- Multimodal tasks (image generation, video analysis) require significant GPU resources
- Very long context windows (100K+ tokens) demand more RAM than most consumer machines have
- Real-time voice and streaming applications benefit from cloud infrastructure - Check your hardware at canirun.ai to see what models your machine can handle
- Install Ollama — one command, all platforms
- Run ollama run llama3.3 — -weight: 500;">start chatting in under a minute
- Experiment — try different models for different tasks
- Integrate — point your apps at localhost:11434 and -weight: 500;">start building