Tools: The Developer's Guide to Running LLMs Locally: Ollama, Gemma 4, and Why Your Side Projects Don't Need an API Key

Tools: The Developer's Guide to Running LLMs Locally: Ollama, Gemma 4, and Why Your Side Projects Don't Need an API Key

Why Local LLMs?

1. Zero Cost Per Request

2. No Rate Limits

3. Privacy by Default

4. Offline Capability

5. Reproducibility

Getting Started: 5 Minutes to Your First Local LLM

Step 1: Install Ollama

Step 2: Pull Gemma 4

Step 3: Test It

Building Applications with Python + Ollama

Adding Structure: The Pattern I Use in 90+ Projects

Adding a Web Interface: Streamlit

Adding an API: FastAPI

Docker: One-Command Deployment

Performance: What to Expect

When to Use Cloud vs. Local

90+ Projects and Counting Every tutorial about building with LLMs starts the same way: "First, get your OpenAI API key." But what if I told you that you can build production-quality AI applications without ever making a cloud API call? I've built over 90 applications using local LLMs — no API keys, no cloud costs, no rate limits. Here's a practical guide to getting started with Ollama and Gemma 4 for your own projects. Before diving into the how, let's talk about why: Cloud APIs charge per token. A moderate application making 1,000 requests/day costs $30-100/month. Scale to production and you're looking at thousands per month. Local inference costs electricity — pennies per hour. I've hit OpenAI rate limits at 3 AM on a Sunday during a hackathon. With local models, you can generate as fast as your hardware allows, 24/7. No data leaves your machine. This isn't just nice-to-have — it's essential for healthcare (HIPAA), legal (attorney-client privilege), finance (PCI), and education (FERPA) applications. Once the model is downloaded, you need zero internet. Build on a plane. Demo without WiFi. Deploy in air-gapped environments. Cloud models change without notice. GPT-4 in January behaves differently than GPT-4 in June. Local models are frozen — same model, same behavior, always. This downloads the model (~5GB). One-time cost, then it's on your machine forever. That's it. You now have a local LLM running on your machine. Here's a minimal Python application: This base class pattern is the foundation of every application I've built. Domain-specific logic goes in subclasses — the LLM integration stays clean and swappable. Three imports. Ten lines. A full web interface for your local AI tool. Now you have a REST API that any frontend, mobile app, or service can call — all running locally. Every project I build ships with this docker-compose.yml: docker compose up — that's the entire deployment story. Works on any machine with Docker and a GPU. On consumer hardware (RTX 3080, 16GB RAM): These are practical, usable response times for interactive applications. My rule: start local, move to cloud only when you've proven the concept and need scale that local hardware can't handle. I've applied this pattern across: Every single one follows the same pattern: Ollama + Gemma 4 + Python + FastAPI + Streamlit + Docker. Start building locally. Your AI projects don't need an API key. *Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team. He maintains 116+ original open-source repositories built with local LLMs. Read more on dev.to.*aipythontutorialdocker Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# macOS / Linux -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Windows # Download from https://ollama.com/download # macOS / Linux -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Windows # Download from https://ollama.com/download # macOS / Linux -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Windows # Download from https://ollama.com/download ollama pull gemma4 ollama pull gemma4 ollama pull gemma4 ollama run gemma4 "Explain quantum computing in one paragraph" ollama run gemma4 "Explain quantum computing in one paragraph" ollama run gemma4 "Explain quantum computing in one paragraph" import ollama def ask(question: str) -> str: response = ollama.generate( model="gemma4", prompt=question, options={"temperature": 0.3} ) return response["response"] # That's literally it print(ask("What are the SOLID principles in software engineering?")) import ollama def ask(question: str) -> str: response = ollama.generate( model="gemma4", prompt=question, options={"temperature": 0.3} ) return response["response"] # That's literally it print(ask("What are the SOLID principles in software engineering?")) import ollama def ask(question: str) -> str: response = ollama.generate( model="gemma4", prompt=question, options={"temperature": 0.3} ) return response["response"] # That's literally it print(ask("What are the SOLID principles in software engineering?")) class LocalLLMApp: def __init__(self, model: str = "gemma4"): self.client = ollama.Client() self.model = model def generate(self, prompt: str, temperature: float = 0.3, system: str = None) -> str: messages = [] if system: messages.append({"role": "system", "content": system}) messages.append({"role": "user", "content": prompt}) response = self.client.chat( model=self.model, messages=messages, options={"temperature": temperature} ) return response["message"]["content"] class LocalLLMApp: def __init__(self, model: str = "gemma4"): self.client = ollama.Client() self.model = model def generate(self, prompt: str, temperature: float = 0.3, system: str = None) -> str: messages = [] if system: messages.append({"role": "system", "content": system}) messages.append({"role": "user", "content": prompt}) response = self.client.chat( model=self.model, messages=messages, options={"temperature": temperature} ) return response["message"]["content"] class LocalLLMApp: def __init__(self, model: str = "gemma4"): self.client = ollama.Client() self.model = model def generate(self, prompt: str, temperature: float = 0.3, system: str = None) -> str: messages = [] if system: messages.append({"role": "system", "content": system}) messages.append({"role": "user", "content": prompt}) response = self.client.chat( model=self.model, messages=messages, options={"temperature": temperature} ) return response["message"]["content"] import streamlit as st app = LocalLLMApp() st.title("My Local AI Tool") user_input = st.text_area("Enter your text:") if st.button("Analyze"): with st.spinner("Thinking..."): result = app.generate(user_input) st.write(result) import streamlit as st app = LocalLLMApp() st.title("My Local AI Tool") user_input = st.text_area("Enter your text:") if st.button("Analyze"): with st.spinner("Thinking..."): result = app.generate(user_input) st.write(result) import streamlit as st app = LocalLLMApp() st.title("My Local AI Tool") user_input = st.text_area("Enter your text:") if st.button("Analyze"): with st.spinner("Thinking..."): result = app.generate(user_input) st.write(result) from fastapi import FastAPI from pydantic import BaseModel api = FastAPI() app = LocalLLMApp() class Query(BaseModel): text: str temperature: float = 0.3 @api.post("/analyze") async def analyze(query: Query): result = app.generate(query.text, temperature=query.temperature) return {"result": result} from fastapi import FastAPI from pydantic import BaseModel api = FastAPI() app = LocalLLMApp() class Query(BaseModel): text: str temperature: float = 0.3 @api.post("/analyze") async def analyze(query: Query): result = app.generate(query.text, temperature=query.temperature) return {"result": result} from fastapi import FastAPI from pydantic import BaseModel api = FastAPI() app = LocalLLMApp() class Query(BaseModel): text: str temperature: float = 0.3 @api.post("/analyze") async def analyze(query: Query): result = app.generate(query.text, temperature=query.temperature) return {"result": result} services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama-data:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] app: build: . ports: - "8501:8501" - "8000:8000" depends_on: - ollama environment: - OLLAMA_HOST=http://ollama:11434 volumes: ollama-data: services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama-data:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] app: build: . ports: - "8501:8501" - "8000:8000" depends_on: - ollama environment: - OLLAMA_HOST=http://ollama:11434 volumes: ollama-data: services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama-data:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] app: build: . ports: - "8501:8501" - "8000:8000" depends_on: - ollama environment: - OLLAMA_HOST=http://ollama:11434 volumes: ollama-data: - Simple Q&A: 0.5-1 second - Paragraph generation: 2-5 seconds - Document analysis (2-3 pages): 5-15 seconds - Long-form generation (1000+ words): 15-30 seconds - Healthcare: Patient intake, lab results, EHR de-identification - Legal: Contract analysis, brief generation, compliance checking - Education: Study bots, exam generators, flashcard creators - Creative: Story generators, poetry engines, mood journals - Developer Tools: Code review, API docs, performance profiling - Finance: Budget analyzers, financial report summarizers - Security: Vulnerability scanners, alert summarizers