Tools: How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide (2026)

Tools: How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide (2026)

What We Are Building

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Quick Start

Prerequisites

Step 1: Deploy Weaviate for Vector Storage

Step 2: Configure Bifrost with Semantic Caching Enabled

Step 3: Point Your LLM Calls Through Bifrost

Step 4: Monitor Cache Hits and Token Savings

How It Works: Exact Hash vs Semantic Similarity

Results: What I Measured After Running This for a Week

Further Reading TL;DR: Semantic caching intercepts LLM API calls and returns cached responses for similar queries, skipping the provider entirely. Zero tokens consumed on cache hits. I set this up with Bifrost and Weaviate in under 30 minutes and it started saving tokens on the first day. A semantic cache layer that sits between your application and LLM providers. Every API call passes through the cache first. If the query matches a previous one (exact match or semantically similar), the cached response is returned instantly. No LLM call, no tokens billed. Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features. Go from zero to production-ready AI gateway in under a minute. Step 1: Start Bifrost Gateway Step 2: Configure via Web UI Step 3: Make your first API call That's it! Your AI gateway is running with a web interface for visual configuration… The end result: repeated and similar queries cost nothing. For workloads with common patterns (customer support, code generation, FAQ bots), the savings add up fast. You need three things: Everything runs locally. No cloud accounts needed beyond your LLM provider key. Weaviate stores the vector embeddings that power semantic matching. When a new query comes in, Bifrost converts it to a vector and checks Weaviate for similar past queries. Create a docker-compose.yml: Verify Weaviate is running: You should see a JSON response with version info. If you get connection refused, give it 30 seconds for the transformer model to load. For more on Weaviate's architecture and vectoriser modules, check their docs. Bifrost is an open-source LLM gateway written in Go. 11 microsecond latency overhead, 5,000 RPS throughput. The part that matters here: it has dual-layer caching built in. Dual-layer means two cache checks run on every request: Or if you prefer npx: Now configure the gateway. Create a config.yaml: Full configuration options are in the Bifrost docs. Bifrost exposes a drop-in replacement for the OpenAI SDK. Change your base URL and everything else stays the same. The first call goes to OpenAI. Tokens are consumed, response is cached. The second call is identical, so the exact hash matches. Response comes from cache. The third call is worded differently but semantically similar. Weaviate's vector search finds the match. Response comes from cache again. Both cache hits skip the LLM provider entirely. Zero tokens. Zero cost. Node.js (OpenAI SDK): Same pattern. Point the base URL at Bifrost, and caching is transparent to your application code. If you are using the Anthropic SDK, Bifrost supports that too. The Anthropic SDK integration page has the details. Once traffic is flowing, you want to see what is hitting cache vs what is going through to providers. Bifrost exposes metrics that let you track: Check your Bifrost logs to see cache behaviour in real time: Each request will indicate whether it was served from cache or forwarded to a provider. Track the ratio over time. On workloads with repeated query patterns, the cache hit rate climbs quickly within the first few hours. A quick breakdown of the two cache layers. Exact hash matching is straightforward. The entire request (messages, model, parameters) is hashed. If an identical request has been seen before, the cached response is returned. This is fast and deterministic. Same input, same output. Semantic similarity is where it gets interesting. When no exact match exists, Bifrost converts the query into a vector embedding using the transformer model running in Weaviate. It then searches for existing cached queries that are semantically close. If the similarity score is above the threshold, the cached response is returned. This is what catches queries like: Different words. Same intent. One LLM call instead of two. The conversation_history_threshold setting controls how many previous messages in a conversation are included when generating the cache key. At the default of 3, Bifrost uses the last 3 messages for context. This prevents a cached response from a different conversation context being returned incorrectly. For more on how sentence embeddings power this kind of similarity search, HuggingFace has a solid primer. I ran this setup against three different workloads for seven days. Here is what I observed. Customer support bot (repetitive queries): Highest cache hit rate. Users ask variations of the same 50-100 questions. After the first day, the cache warmed up and a large portion of queries were served from cache. Semantic matching caught the paraphrased versions that exact hash would miss. Code generation assistant (moderate repetition): Lower hit rate than customer support, but still meaningful. Common patterns like "write a function to parse JSON" or "create a REST endpoint" showed up repeatedly with slight variations. Semantic caching caught many of these. Open-ended research queries (low repetition): Lowest hit rate, as expected. Each query was unique enough that neither exact nor semantic matching triggered often. Caching still helped with follow-up questions that rephrased earlier queries. Latency on cache hits: Near-instant. The Weaviate vector lookup adds milliseconds, but compared to a full LLM round trip (typically 500ms to 3s), cache hits felt instantaneous. Gateway overhead: Bifrost's 11 microsecond latency overhead held up. The caching layer adds the Weaviate lookup time on misses and hits, but the gateway itself adds almost nothing. The workloads where semantic caching pays off most are the ones with natural query repetition. Customer support, internal knowledge bases, FAQ systems, onboarding assistants. If your users ask the same things in different ways, you are paying for the same answer multiple times. For reference, here is what OpenAI charges per token and what Anthropic charges. On GPT-4o at current pricing, even a moderate cache hit rate translates to real savings on a monthly bill. Bifrost GitHub | Docs | Website If you are running LLM workloads with any kind of query repetition, set up semantic caching before optimising anything else. It is the lowest-effort, highest-impact cost reduction I have found. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Install and run locally npx -y @maximhq/bifrost # Or use Docker -weight: 500;">docker run -p 8080:8080 maximhq/bifrost # Install and run locally npx -y @maximhq/bifrost # Or use Docker -weight: 500;">docker run -p 8080:8080 maximhq/bifrost # Open the built-in web interface open http://localhost:8080 # Open the built-in web interface open http://localhost:8080 -weight: 500;">curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Hello, Bifrost!"}] }' -weight: 500;">curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Hello, Bifrost!"}] }' App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data: version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data: version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data: -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d -weight: 500;">docker compose up -d -weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool -weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool -weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool -weight: 500;">docker run -p 8080:8080 maximhq/bifrost -weight: 500;">docker run -p 8080:8080 maximhq/bifrost -weight: 500;">docker run -p 8080:8080 maximhq/bifrost npx -y @maximhq/bifrost npx -y @maximhq/bifrost npx -y @maximhq/bifrost gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30 gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30 gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30 from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key" ) # First call - cache miss, hits the LLM provider response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ] ) print(response.choices[0].message.content) from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key" ) # First call - cache miss, hits the LLM provider response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ] ) print(response.choices[0].message.content) from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key" ) # First call - cache miss, hits the LLM provider response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ] ) print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ] ) print(response.choices[0].message.content) import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key', }); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ], }); console.log(response.choices[0].message.content); import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key', }); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ], }); console.log(response.choices[0].message.content); import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key', }); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ], }); console.log(response.choices[0].message.content); -weight: 500;">docker logs -f <bifrost-container-id> -weight: 500;">docker logs -f <bifrost-container-id> -weight: 500;">docker logs -f <bifrost-container-id> - Docker and Docker Compose installed (docs) - Weaviate as the vector store for semantic similarity matching - Bifrost as the LLM gateway with caching enabled - At least one LLM provider API key (OpenAI, Anthropic, etc.) - Exact hash match - identical queries return cached responses instantly - Semantic similarity - queries that mean the same thing but are worded differently also hit the cache - cache.enabled: true turns on the dual-layer cache - cache.type: "semantic" enables both exact hash and semantic similarity (not just exact match) - vector_store.provider: "weaviate" points to your Weaviate instance - conversation_history_threshold: 3 controls how much conversation context is used for cache key generation. Default is 3. Higher values mean more context-sensitive cache matching but fewer hits. - Cache hit rate (exact vs semantic) - Total requests vs routed requests (routed = cache misses that hit a provider) - Token usage per provider - "How do I deploy to Kubernetes?" and "What is the process for deploying on k8s?" - "Explain OAuth 2.0" and "How does OAuth2 authentication work?" - Bifrost Semantic Caching Docs - full config reference - Bifrost Setup Guide - getting started from scratch - Weaviate Developer Docs - vector store configuration and modules - Getting Started with Embeddings (HuggingFace) - how sentence embeddings work - Redis Caching Patterns - general caching concepts for comparison