Tools

Tools: Making an Ollama Mesh with Nginx

2026-02-06 0 views admin

Tools: Making an Ollama Mesh with Nginx

Source: Dev.to

Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully ## Why Ollama Needs Special Load Balancing ## Baseline NGINX Upstream Configuration ## least_conn: The Right Algorithm for GPUs ## Failure Handling for Long Downtime Nodes ## Connection Reuse Matters (keepalive) ## Detecting Failures Quickly (Timeouts) ## Retrying Failed Requests ## Planned Maintenance: Disable a Node ## Recommended Production Baseline ## Upstream ## Proxy Location ## What This Setup Does Not Do ## Final Thoughts Running Ollama on a single machine is easy. Running multiple Ollama instances across your LAN—and surviving GPU stalls, reboots, or long outages—is where things get interesting. This post walks through a production-grade NGINX upstream configuration for Ollama, explains how it behaves under load, and shows how to tune it when one machine might be down for minutes or hours. Ollama workloads are not typical web traffic: Classic round-robin load balancing performs poorly here. What you want instead: least_conn routes new requests to the backend with the fewest active connections. Why this works so well for Ollama: This gives you implicit weighting without hardcoding values. For known long outages, consider fail_timeout=300s. This enables TCP reuse between NGINX and Ollama. NGINX will retry failed requests on other nodes. ⚠️ LLM responses are not guaranteed idempotent. Reload NGINX to immediately remove the node from rotation. Tune proxy_read_timeout based on streaming vs blocking usage. This configuration does not: Use it on trusted networks or behind TLS + auth. It’s a solid foundation for any serious Ollama deployment. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } CODE_BLOCK: server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; CODE_BLOCK: server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; CODE_BLOCK: keepalive 64; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: keepalive 64; CODE_BLOCK: keepalive 64; CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; CODE_BLOCK: proxy_connect_timeout 2s; proxy_send_timeout 30s; proxy_read_timeout 30s; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: proxy_connect_timeout 2s; proxy_send_timeout 30s; proxy_read_timeout 30s; CODE_BLOCK: proxy_connect_timeout 2s; proxy_send_timeout 30s; proxy_read_timeout 30s; CODE_BLOCK: proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; CODE_BLOCK: proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; CODE_BLOCK: server 192.168.0.141:11434 down; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: server 192.168.0.141:11434 down; CODE_BLOCK: server 192.168.0.141:11434 down; CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } CODE_BLOCK: upstream ollama_pool { least_conn; server 192.168.0.169:11434 max_fails=2 fail_timeout=60s; server 192.168.0.156:11434 max_fails=2 fail_timeout=60s; server 192.168.0.141:11434 max_fails=2 fail_timeout=60s; keepalive 64; } CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; proxy_connect_timeout 2s; proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; proxy_connect_timeout 2s; proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; CODE_BLOCK: proxy_http_version 1.1; proxy_set_header Connection ""; proxy_connect_timeout 2s; proxy_next_upstream error timeout http_500 http_502 http_503 http_504; proxy_next_upstream_tries 3; - Requests are long-lived - Execution time varies wildly (model + prompt dependent) - GPUs saturate before CPUs - A “slow” request is not a failure - Nodes can vanish mid-generation - Adaptive request distribution - Fast eviction of dead nodes - Minimal retry thrashing - Connection reuse - LLM requests are long-running - Faster GPUs finish sooner - Finished nodes naturally get more work - After 2 failures within 60 seconds - The node is marked down for 60 seconds - No traffic is sent during that time - Afterward, NGINX retries automatically - Fewer handshakes - Lower latency - Better streaming stability - proxy_connect_timeout handles dead hosts fast - proxy_read_timeout must be tuned: Streaming → lower OK Blocking generations → higher needed - Streaming → lower OK - Blocking generations → higher needed - Streaming → lower OK - Blocking generations → higher needed - Share model state - Provide session stickiness - Add authentication - Expose Ollama safely to the internet - Smart GPU-aware load balancing - Automatic failover - Graceful handling of dead machines - Minimal operational overhead

🏷️ Tags

how-totutorialguidedev.toaillmservernetworknginxnode