Tools: One Redis Instance, Three Jobs: DevOps for AI Agents Without the Overkill

Tools: One Redis Instance, Three Jobs: DevOps for AI Agents Without the Overkill

The entire infrastructure: one docker compose yaml

What Redis actually does (and why it is elegant)

1. Conversation memory (checkpointer)

2. Long term memory (vector index)

3. Semantic cache

No state in the application

Ollama as a local inference server

Monitoring without extra tooling

The cost model is different

Backup and persistence

What I would do differently in production

Conclusion I've built my share of microservices over the years. It usually ends with an architecture where every service has its own database, its own cache, its own queue, and a 200 line YAML file just to hold everything together in Docker Compose. When I started experimenting with AI agents, I expected the same story. A vector database here, a message queue there, a cache service, a state store. But it turned out that Redis Stack handles all of it. And it simplifies operations more than I expected. That is it. Two containers. No Postgres, no Pinecone, no RabbitMQ, no separate Memcached. Redis handles all three jobs and Ollama runs models locally. LangGraph needs somewhere to persist conversation state, such as which messages were sent, which tools were called, and what the results were. RedisSaver handles that: Each thread (thread_id) gets its own history. Restart the agent and reconnect to the same thread? It picks up right where it left off. No manual serialization, no migrations. From an ops perspective: these are just regular Redis keys. You can inspect them in RedisInsight, set TTL to automatically clean up old conversations, and monitor memory usage with INFO memory. RedisStore with a vector index gives you semantic search. Data is stored with embeddings and can be queried based on meaning, not exact string matching. The point from a DevOps perspective: you do not need to run a separate vector database. No Milvus, no Qdrant, no Weaviate. Redis Stack includes RediSearch out of the box, and it is more than sufficient for this kind of workload. The same question phrased slightly differently, like "what is WCAG?" versus "explain WCAG to me", produces the same answer. Instead of sending both through the LLM, we cache responses based on vector proximity: A distance_threshold of 0.1 means queries with cosine distance ≤0.1 get cached responses. In practice, this means very similar questions. Where exactly to set this threshold depends on your embedding model (different models spread their vectors differently), so experiment. TTL of 3600 seconds automatically cleans up stale data. Three completely different use cases. Same redis://localhost:6379. This is the biggest win for operations. The agent itself is stateless. All state lives in Redis: In practice: docker compose restart agent and the user notices nothing. Ollama abstracts away model management. You pull a model once, then it is exposed as an HTTP API: From the agent, it looks like any other API call: No API key management. No rate limiting. No cost per token. Models run locally on your GPU. Want to swap models? Change an environment variable: No code changes. Ollama handles the rest. Redis has built in monitoring that goes a long way: RedisInsight (port 8001 in the docker compose) gives you a web UI to inspect keys, run queries, and view memory graphs. It is included in the redis-stack image. Need more? A Prometheus exporter exists for Redis (redis_exporter) and is straightforward to set up. But for most use cases, the built in tools are enough. With cloud AI (OpenAI, Claude, etc) you pay per token. That makes costs unpredictable, as an agent that makes many tool calls can get expensive. With Ollama locally, the cost is fixed to your hardware. A machine with an RTX 4070 (12 GB VRAM) costs around $1,500 and runs qwen3.5:4b fast enough for production. The difference becomes dramatic at volume. And the local model works well enough for tasks with clear context. You can also mix them. Run locally for 90% of traffic and fall back to a cloud model for complex queries. Redis Stack with appendonly yes (default in redis stack) gives you AOF persistence. Every write is logged to disk. On restart, everything is restored. Ollama models are cached in /root/.ollama. Mount it as a Docker volume and models survive container restarts without needing to be downloaded again. What makes this stack attractive for DevOps is its simplicity. Two containers. No external state. Standard monitoring tools. Predictable cost. Redis is no longer just a cache. With the Stack distribution, it replaces three or four services that would otherwise require separate operations. LangGraph abstracts away agent orchestration. Ollama turns LLM inference into a local service. Bottom line: less to operate, less that can break, easier to debug. Stack: Redis Stack, LangGraph, Ollama. Everything runs in Docker. You need a GPU with ≥8 GB VRAM for local models, or point Ollama at a cloud endpoint. Published: March 2026 | Daniel Gustafsson Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ services: redis: image: redis/redis-stack:latest ports: - "6379:6379" - "8001:8001" # RedisInsight volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 5 ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama_models:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] healthcheck: test: ["CMD-SHELL", "-weight: 500;">curl -f http://localhost:11434/api/tags || exit 1"] interval: 10s timeout: 5s retries: 3 volumes: redis_data: ollama_models: services: redis: image: redis/redis-stack:latest ports: - "6379:6379" - "8001:8001" # RedisInsight volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 5 ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama_models:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] healthcheck: test: ["CMD-SHELL", "-weight: 500;">curl -f http://localhost:11434/api/tags || exit 1"] interval: 10s timeout: 5s retries: 3 volumes: redis_data: ollama_models: services: redis: image: redis/redis-stack:latest ports: - "6379:6379" - "8001:8001" # RedisInsight volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 5 ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama_models:/root/.ollama deploy: resources: reservations: devices: - capabilities: [gpu] healthcheck: test: ["CMD-SHELL", "-weight: 500;">curl -f http://localhost:11434/api/tags || exit 1"] interval: 10s timeout: 5s retries: 3 volumes: redis_data: ollama_models: with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer: checkpointer.setup() agent = create_react_agent(..., checkpointer=checkpointer) with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer: checkpointer.setup() agent = create_react_agent(..., checkpointer=checkpointer) with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer: checkpointer.setup() agent = create_react_agent(..., checkpointer=checkpointer) with RedisStore.from_conn_string( "redis://localhost:6379", index={ "embed": embeddings, "dims": 768, "distance_type": "cosine", "fields": ["text"], }, ) as store: store.setup() with RedisStore.from_conn_string( "redis://localhost:6379", index={ "embed": embeddings, "dims": 768, "distance_type": "cosine", "fields": ["text"], }, ) as store: store.setup() with RedisStore.from_conn_string( "redis://localhost:6379", index={ "embed": embeddings, "dims": 768, "distance_type": "cosine", "fields": ["text"], }, ) as store: store.setup() from redisvl.extensions.llmcache import SemanticCache cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.1, ttl=3600, ) from redisvl.extensions.llmcache import SemanticCache cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.1, ttl=3600, ) from redisvl.extensions.llmcache import SemanticCache cache = SemanticCache( name="llm_cache", redis_url="redis://localhost:6379", distance_threshold=0.1, ttl=3600, ) ollama pull qwen3.5:4b # 2.5 GB, requires ~4 GB VRAM ollama pull nomic-embed-text # 274 MB, for embeddings ollama pull qwen3.5:4b # 2.5 GB, requires ~4 GB VRAM ollama pull nomic-embed-text # 274 MB, for embeddings ollama pull qwen3.5:4b # 2.5 GB, requires ~4 GB VRAM ollama pull nomic-embed-text # 274 MB, for embeddings model = ChatOllama( model="qwen3.5:4b", base_url="http://ollama:11434", # Docker -weight: 500;">service name ) model = ChatOllama( model="qwen3.5:4b", base_url="http://ollama:11434", # Docker -weight: 500;">service name ) model = ChatOllama( model="qwen3.5:4b", base_url="http://ollama:11434", # Docker -weight: 500;">service name ) CHAT_MODEL=qwen3.5:4b EMBEDDING_MODEL=nomic-embed-text CHAT_MODEL=qwen3.5:4b EMBEDDING_MODEL=nomic-embed-text CHAT_MODEL=qwen3.5:4b EMBEDDING_MODEL=nomic-embed-text # Memory usage redis-cli INFO memory | grep used_memory_human # Key count redis-cli DBSIZE # Live command stream redis-cli MONITOR # Slow queries redis-cli SLOWLOG GET 10 # Memory usage redis-cli INFO memory | grep used_memory_human # Key count redis-cli DBSIZE # Live command stream redis-cli MONITOR # Slow queries redis-cli SLOWLOG GET 10 # Memory usage redis-cli INFO memory | grep used_memory_human # Key count redis-cli DBSIZE # Live command stream redis-cli MONITOR # Slow queries redis-cli SLOWLOG GET 10 # Which models are loaded? -weight: 500;">curl http://localhost:11434/api/tags # How much VRAM is being used? nvidia-smi # Which models are loaded? -weight: 500;">curl http://localhost:11434/api/tags # How much VRAM is being used? nvidia-smi # Which models are loaded? -weight: 500;">curl http://localhost:11434/api/tags # How much VRAM is being used? nvidia-smi # Snapshot redis-cli BGSAVE cp /data/dump.rdb /backup/redis.rdb # Or copy AOF cp /data/appendonly.aof /backup/ # Snapshot redis-cli BGSAVE cp /data/dump.rdb /backup/redis.rdb # Or copy AOF cp /data/appendonly.aof /backup/ # Snapshot redis-cli BGSAVE cp /data/dump.rdb /backup/redis.rdb # Or copy AOF cp /data/appendonly.aof /backup/ - Conversation history: Redis (checkpointer) - Saved memories: Redis (vector index) - Cached responses: Redis (semantic cache) - Scan history: Redis (vector index) - Restart the agent without losing anything - Run multiple instances behind a load balancer - Scale horizontally without shared state in memory - Deploy new versions with zero downtime (rolling -weight: 500;">update) - Cloud (e.g. Qwen 3 72b via OpenRouter): ~$0.005 per scan. 200 scans per day = $30 per month - Local (Qwen 3.5 4b): ~$0 per scan. Unlimited. - Set maxmemory and eviction policy: Redis without a memory limit on a shared machine is a ticking time bomb. maxmemory-policy allkeys-lru automatically evicts the oldest entries. - TTL on everything that does not need to live forever: Cached LLM responses: 1 hour. Conversation history: 7 days. Scan history: keep. - Separate Redis instances per environment: Dev, staging, prod should not share data. Use key prefixes (dev:, staging:, prod:) or ideally separate Redis instances entirely. Avoid logical databases (/0, /1, /2). RediSearch and other modules only work on database 0, and clustering does not support them either. - Health checks in -weight: 500;">docker compose: Already included in the example above. If you add an agent -weight: 500;">service, use depends_on with condition: service_healthy so it does not -weight: 500;">start before Redis and Ollama are ready. - Log token usage: Even with local models, you want to know how much inference you are running. It helps with capacity planning.