Tools: Complete Guide to Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Tools: Complete Guide to Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Why Hetzner for AI Workloads

The Price Gap Is Real

What Hetzner Does Well

What Hetzner Does Not Do

The Hetzner AI Server Lineup

Tier 1: Cost-Optimized Cloud (CX Series) — €3.99-€14.99/mo

Tier 2: ARM Cloud Instances (CAX Series) — Better Performance per Euro

Tier 3: GEX44 — Dedicated GPU Server (€184/mo)

Tier 4: GEX131 — High-End GPU Server

Budget Path: Running Small LLMs on CX/CAX Instances

What You Can Run

Performance Reality Check

GPU Path: Setting Up the GEX44

Step 1: Order and Initial Access

Step 2: Install NVIDIA Drivers

Step 3: Install Docker with GPU Support

Step 4: Deploy Ollama with GPU Acceleration

Step 5: Pull Models That Fit 20 GB VRAM

What the GEX44 Can Actually Run

Step 6: Set Up HTTPS

Cost Comparison: Hetzner vs AWS vs GCP

Entry-Level GPU Tier

Budget CPU Tier (No GPU)

What the Cloud Providers Offer That Hetzner Does Not

Deployment with Docker and Coolify

Our Infrastructure at Effloow

Choosing the Right Tier

CX23 (€3.99/mo) — Start Here If...

CX33/CAX31 (€6.49-€10/mo) — Upgrade When...

GEX44 (€184/mo) — The AI Sweet Spot If...

GEX131 — Production AI If...

Getting Started: Your First Hour

Conclusion Running AI workloads on AWS or GCP is expensive. A single A100 instance on AWS costs $3-4 per hour — over $2,000 a month if you leave it running. For startups, indie developers, and small teams experimenting with AI, that math kills projects before they start. Hetzner offers an alternative that most of the AI community outside Europe has not discovered yet. Budget cloud instances from €3.99/month for lightweight inference. Dedicated GPU servers with NVIDIA RTX 4000 Ada from €184/month. European data centers with flat monthly pricing and no bandwidth surprises. This guide covers the full Hetzner AI server lineup, from $5/month CPU instances running tiny models to dedicated GPU servers handling production workloads. We will walk through actual setup, realistic performance expectations, and an honest cost comparison against AWS and GCP. Hetzner is a German hosting company that has been around since 1997. They are not a startup. They run their own data centers in Falkenstein, Nuremberg, and Helsinki. Their pricing has always been aggressive compared to US-based cloud providers, and that gap has only widened as AWS and GCP have raised prices. Hetzner's cost advantage is not 10-20% — it is 60-80% for equivalent compute. A Hetzner cloud server with 2 vCPUs and 4 GB RAM costs €3.99/month. A comparable instance on AWS (t3.medium) costs roughly $30/month. DigitalOcean and Vultr sit in between at $15-20/month for similar specs. For AI workloads specifically, the gap gets even wider at the GPU tier. Hetzner's dedicated GPU servers start at €184/month. AWS GPU instances (g5.xlarge with A10G) start at roughly $1.00/hour — over $700/month for always-on use. Hetzner offers multiple tiers for AI workloads. Here is the full spectrum from budget to production. These are shared vCPU instances. No GPU. CPU-only inference for small models. AI use case: Running Ollama with small models (3B-7B parameters) for personal chatbots, lightweight RAG, or API-based inference for low-traffic applications. We covered this exact setup in our Ollama + Open WebUI self-hosting guide. Realistic expectations: A CX23 can run a 3B model at roughly 3-6 tokens/second (CPU inference). A CX33 can handle a 7-8B model at 1-3 tokens/second. This is usable for async workflows but not for interactive chat. Hetzner's Ampere-based ARM servers offer better compute efficiency than the x86 CX series at similar or lower price points. AI use case: ARM chips handle inference workloads efficiently. Ollama has native ARM support, so these servers run small models with lower power draw and often better single-thread performance than the CX series at the same price. Good for always-on inference APIs. This is where things get serious for AI workloads. AI use case: The RTX 4000 SFF Ada with 20 GB VRAM can run models up to ~32B parameters (4-bit quantized). It handles 7B-14B models comfortably with fast inference. This is the sweet spot for small teams running production AI inference, fine-tuning smaller models, or serving multiple users simultaneously. The 20 GB of VRAM is the key spec. It puts this server above consumer RTX 4060/4070 cards (8-12 GB) and into territory where you can run meaningful models without aggressive quantization. For production AI workloads that need serious GPU compute. AI use case: With 96 GB of VRAM, this server can run 70B+ parameter models at full precision, handle multiple concurrent inference requests, or fine-tune large models. The 5th-generation Tensor Cores and Blackwell architecture make this competitive with cloud A100 instances at a fraction of the cost. 256 GB of system RAM with expansion to 768 GB also makes this viable for large-scale RAG deployments where you need to keep embedding databases in memory. You do not need a GPU to run AI inference. CPU-only inference with quantized models is slow but functional — and incredibly cheap. On a CX23 (€3.99/month, 4 GB RAM): On a CX33 (€6.49/month, 8 GB RAM): Install Docker and run Ollama: For a full web interface, add Open WebUI as described in our Ollama self-hosting guide. If you are running multiple services on the same server, a deployment platform like Coolify or Dokploy simplifies container management significantly. CPU inference is measured in single-digit tokens per second. Here is what to expect: These numbers are usable for: API backends with tolerant timeouts, batch processing, personal assistants where you can wait a few seconds, and development/testing before deploying to GPU servers. They are not usable for: real-time chat with multiple users, latency-sensitive applications, or anything requiring more than a few concurrent requests. The GEX44 at €184/month is the entry point for serious AI work on Hetzner. Here is how to set it up from scratch. Order from the Hetzner Robot panel. Expect the €79 setup fee on your first invoice. Provisioning typically takes 1-3 business days for dedicated servers (unlike cloud instances which spin up in seconds). Once provisioned, you will receive root SSH access: The GEX44 comes with bare metal access. You need to install GPU drivers: After reboot, verify the GPU is recognized: You should see the RTX 4000 SFF Ada with 20 GB VRAM listed. Verify Docker can see the GPU: Create docker-compose.yml: With 20 GB of VRAM, you can run substantial models: The sweet spot is 14B models. They fit comfortably in 20 GB with room for context, run at speeds that feel interactive, and deliver quality that is genuinely useful for production work. For remote access, add Caddy as a reverse proxy: Edit /etc/caddy/Caddyfile: Caddy handles SSL automatically. Access your AI at https://ai.yourdomain.com. Here is an honest comparison for equivalent GPU compute, based on always-on monthly pricing as of early 2026. Hetzner is 2.5-3.3× cheaper than hyperscalers for comparable GPU compute. The trade-offs: no managed ML services, manual setup, and EU-only data centers. At the budget tier, Hetzner is 4-6× cheaper. This is where it shines for development, testing, and low-traffic inference. Bottom line: Hetzner wins on predictable, always-on workloads where you know your compute needs. Hyperscalers win on variable demand, managed services, and global distribution. If you are running multiple AI services (Ollama, vector databases, monitoring) alongside other applications on the same Hetzner server, manual Docker Compose management gets tedious. This is where a self-hosted PaaS like Coolify or Dokploy adds value. We compared both platforms in detail in our Coolify vs Dokploy comparison. The short version: Either one gives you a web dashboard for managing containers, automatic SSL, Git-based deployments, and basic monitoring — without touching the command line every time you need to update a container. For a full walkthrough of running Coolify on Hetzner alongside other developer tools, see our self-hosting dev stack guide. At Effloow, we run 14 AI agents that handle everything from content research to code generation. Our infrastructure choices reflect the same cost-conscious thinking behind this guide. We use Hetzner cloud instances for non-GPU workloads: deployment platforms, Git hosting, monitoring, and lightweight services. The flat monthly pricing means our infrastructure bill is predictable regardless of how many articles the agents produce. For AI inference specifically, we use a mix of API services (Claude, GPT) for tasks requiring frontier intelligence and self-hosted models for high-volume, lower-complexity work. The GEX44 tier is compelling for teams at our stage — it is enough GPU to run production inference at a cost that does not require venture capital to sustain. The decision framework we use internally: Here is a quick decision guide: If you are new to Hetzner, here is the fastest path to running AI: Total time: under 10 minutes. Total cost: €3.99 for the first month. When you outgrow the CX23, migrate your Ollama data volume to a bigger instance. When you need GPU speed, order a GEX44 and follow the GPU setup section above. Hetzner is not the right choice for every AI workload. If you need managed ML services, global data centers, or spot pricing for burst GPU compute, the hyperscalers are still the answer. But for predictable, always-on AI infrastructure at a fraction of the cost — personal AI assistants, team inference servers, self-hosted chatbots, development and testing environments — Hetzner is hard to beat. The lineup covers the full spectrum: €3.99/month for experimentation, €184/month for production GPU inference, and higher tiers for serious AI workloads. All with flat pricing, unlimited bandwidth, and EU data residency. Start with a CX23 and a 3B model. See if self-hosted inference fits your workflow. If it does, the upgrade path is straightforward — bigger instances, better models, and eventually dedicated GPU hardware, all from the same provider. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama \ ollama/ollama:latest # Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama \ ollama/ollama:latest # Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama \ ollama/ollama:latest # Pull a model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b ssh root@your-server-ip ssh root@your-server-ip ssh root@your-server-ip # Update system -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">upgrade -y # Install NVIDIA driver dependencies -weight: 500;">apt -weight: 500;">install -y build-essential linux-headers-$(uname -r) # Install NVIDIA drivers (Ubuntu 22.04/24.04) -weight: 500;">apt -weight: 500;">install -y nvidia-driver-550 # Reboot reboot # Update system -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">upgrade -y # Install NVIDIA driver dependencies -weight: 500;">apt -weight: 500;">install -y build-essential linux-headers-$(uname -r) # Install NVIDIA drivers (Ubuntu 22.04/24.04) -weight: 500;">apt -weight: 500;">install -y nvidia-driver-550 # Reboot reboot # Update system -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">upgrade -y # Install NVIDIA driver dependencies -weight: 500;">apt -weight: 500;">install -y build-essential linux-headers-$(uname -r) # Install NVIDIA drivers (Ubuntu 22.04/24.04) -weight: 500;">apt -weight: 500;">install -y nvidia-driver-550 # Reboot reboot # Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Install NVIDIA Container Toolkit -weight: 500;">curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">install -y nvidia-container-toolkit nvidia-ctk runtime configure --runtime=-weight: 500;">docker -weight: 500;">systemctl -weight: 500;">restart -weight: 500;">docker # Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Install NVIDIA Container Toolkit -weight: 500;">curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">install -y nvidia-container-toolkit nvidia-ctk runtime configure --runtime=-weight: 500;">docker -weight: 500;">systemctl -weight: 500;">restart -weight: 500;">docker # Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # Install NVIDIA Container Toolkit -weight: 500;">curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg -weight: 500;">curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ tee /etc/-weight: 500;">apt/sources.list.d/nvidia-container-toolkit.list -weight: 500;">apt -weight: 500;">update && -weight: 500;">apt -weight: 500;">install -y nvidia-container-toolkit nvidia-ctk runtime configure --runtime=-weight: 500;">docker -weight: 500;">systemctl -weight: 500;">restart -weight: 500;">docker -weight: 500;">docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi -weight: 500;">docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi -weight: 500;">docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=4 - OLLAMA_MAX_LOADED_MODELS=2 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=4 - OLLAMA_MAX_LOADED_MODELS=2 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: version: "3.8" services: ollama: image: ollama/ollama:latest container_name: ollama -weight: 500;">restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_NUM_PARALLEL=4 - OLLAMA_MAX_LOADED_MODELS=2 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui -weight: 500;">restart: unless-stopped ports: - "3000:8080" volumes: - open_webui_data:/app/backend/data environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_AUTH=true - WEBUI_SECRET_KEY=change-this-to-a-random-string depends_on: - ollama volumes: ollama_data: open_webui_data: -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d # 14B model — fits easily, fast inference -weight: 500;">docker exec -it ollama ollama pull phi-4:14b # 32B model (Q4) — fits in ~18 GB, good quality -weight: 500;">docker exec -it ollama ollama pull qwen2.5:32b-instruct-q4_K_M # Coding-specific model -weight: 500;">docker exec -it ollama ollama pull qwen2.5-coder:14b # 14B model — fits easily, fast inference -weight: 500;">docker exec -it ollama ollama pull phi-4:14b # 32B model (Q4) — fits in ~18 GB, good quality -weight: 500;">docker exec -it ollama ollama pull qwen2.5:32b-instruct-q4_K_M # Coding-specific model -weight: 500;">docker exec -it ollama ollama pull qwen2.5-coder:14b # 14B model — fits easily, fast inference -weight: 500;">docker exec -it ollama ollama pull phi-4:14b # 32B model (Q4) — fits in ~18 GB, good quality -weight: 500;">docker exec -it ollama ollama pull qwen2.5:32b-instruct-q4_K_M # Coding-specific model -weight: 500;">docker exec -it ollama ollama pull qwen2.5-coder:14b -weight: 500;">apt -weight: 500;">install -y caddy -weight: 500;">apt -weight: 500;">install -y caddy -weight: 500;">apt -weight: 500;">install -y caddy ai.yourdomain.com { reverse_proxy localhost:3000 } ai.yourdomain.com { reverse_proxy localhost:3000 } ai.yourdomain.com { reverse_proxy localhost:3000 } -weight: 500;">systemctl reload caddy -weight: 500;">systemctl reload caddy -weight: 500;">systemctl reload caddy # 1. Sign up at hetzner.com and create a cloud project # 2. Create a CX23 instance (€3.99/mo) via the console # - Choose Ubuntu 24.04 # - Add your SSH key # - Pick Falkenstein or Helsinki # 3. SSH into your server ssh root@your-server-ip # 4. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 5. Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama ollama/ollama:latest # 6. Pull a small model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # 7. Test it -weight: 500;">curl http://localhost:11434/api/generate \ -d '{"model": "llama3.2:3b", "prompt": "Hello, how are you?"}' # 1. Sign up at hetzner.com and create a cloud project # 2. Create a CX23 instance (€3.99/mo) via the console # - Choose Ubuntu 24.04 # - Add your SSH key # - Pick Falkenstein or Helsinki # 3. SSH into your server ssh root@your-server-ip # 4. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 5. Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama ollama/ollama:latest # 6. Pull a small model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # 7. Test it -weight: 500;">curl http://localhost:11434/api/generate \ -d '{"model": "llama3.2:3b", "prompt": "Hello, how are you?"}' # 1. Sign up at hetzner.com and create a cloud project # 2. Create a CX23 instance (€3.99/mo) via the console # - Choose Ubuntu 24.04 # - Add your SSH key # - Pick Falkenstein or Helsinki # 3. SSH into your server ssh root@your-server-ip # 4. Install Docker -weight: 500;">curl -fsSL https://get.-weight: 500;">docker.com | sh # 5. Run Ollama -weight: 500;">docker run -d --name ollama -p 11434:11434 \ -v ollama_data:/root/.ollama ollama/ollama:latest # 6. Pull a small model -weight: 500;">docker exec -it ollama ollama pull llama3.2:3b # 7. Test it -weight: 500;">curl http://localhost:11434/api/generate \ -d '{"model": "llama3.2:3b", "prompt": "Hello, how are you?"}' - Flat monthly pricing. No surprise bandwidth bills, no hidden egress charges. Traffic is unlimited on most plans. - EU data centers. Falkenstein and Helsinki give you GDPR compliance by default. - Straightforward networking. Private networks, floating IPs, and load balancers at prices that make sense. - ARM instances. Ampere-based CAX servers offer strong performance-per-euro for inference workloads. - No managed AI/ML services. No SageMaker equivalent, no managed Jupyter, no model registries. You manage everything yourself. - No spot/preemptible instances. You cannot get cheap burst GPU time. It is flat monthly pricing or nothing. - Limited GPU availability. Dedicated GPU servers can have waitlists. AWS and GCP have broader GPU SKU availability. - No US data centers. If you need sub-50ms latency for US users, Hetzner is not the right choice. - Llama 3.2 3B (Q4) — Fits in ~2-3 GB. General chat and simple tasks. - Phi-3.5 Mini 3.8B (Q4) — Microsoft's efficient model. Good for code and reasoning. - TinyLlama 1.1B — Fast even on CPU. Useful for classification and simple generation. - Llama 3.2 8B (Q4) — Solid general model. ~5 GB loaded. - Gemma 2 2B — Google's efficient model. Punches above its weight. - Qwen 2.5 7B (Q4) — Excellent for multilingual use cases. - AWS SageMaker / GCP Vertex AI — Managed model training, deployment, and monitoring. If you need MLOps at scale, Hetzner's bare metal cannot compete. - Spot/preemptible instances — AWS spot pricing can bring GPU costs down 60-70% for interruptible workloads. Hetzner has no equivalent. - Global regions — AWS has 30+ regions worldwide. Hetzner has 3 European locations. - Auto-scaling — Cloud providers scale GPU instances based on demand. Hetzner dedicated servers are fixed capacity. - Coolify — More mature, better for multi--weight: 500;">service deployments, built-in database management. - Dokploy — Simpler, lighter footprint, good if Ollama is your primary workload. - Need frontier intelligence (complex reasoning, creative work)? → Use API services. - Need high-volume, predictable inference? → Self-host on Hetzner GPU. - Need lightweight, always-on AI? → CX/CAX instance with small models. - Need managed MLOps at scale? → Use AWS/GCP (we do not, but many teams should). - You are experimenting with self-hosted AI for the first time - You need a personal chatbot or simple RAG pipeline - Your queries are infrequent and latency is not critical - Budget is the primary constraint - You need 7-8B models with slightly better response times - You are running the AI alongside other services (Git, CI, monitoring) - Multiple people on your team need occasional access - You need interactive-speed inference (30+ tokens/second) - You want to run 14B-32B models with real quality - Multiple users need concurrent access - You are building products or services that rely on AI inference - Fine-tuning smaller models is part of your workflow - You need 70B+ models at full precision - Multi-user production inference is a requirement - You are fine-tuning large models regularly - You need 96 GB VRAM for large embedding databases or multi-model serving