# Install and run locally
npx -y @maximhq/bifrost # Or use Docker
-weight: 500;">docker run -p 8080:8080 maximhq/bifrost
# Install and run locally
npx -y @maximhq/bifrost # Or use Docker
-weight: 500;">docker run -p 8080:8080 maximhq/bifrost
# Open the built-in web interface
open http://localhost:8080
# Open the built-in web interface
open http://localhost:8080
-weight: 500;">curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Hello, Bifrost!"}] }'
-weight: 500;">curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Hello, Bifrost!"}] }'
App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return
App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return
App -> Bifrost Gateway -> [Cache Check] -> Hit? -> Return cached response (0 tokens) -> Miss? -> Forward to LLM provider -> Cache response -> Return
version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data:
version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data:
version: '3.8' services: weaviate: image: cr.weaviate.io/semitechnologies/weaviate:latest ports: - "8081:8080" - "50051:50051" environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080' CLUSTER_HOSTNAME: 'node1' volumes: - weaviate_data:/var/lib/weaviate -weight: 500;">restart: on-failure t2v-transformers: image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2 environment: ENABLE_CUDA: '0' -weight: 500;">restart: on-failure volumes: weaviate_data:
-weight: 500;">docker compose up -d
-weight: 500;">docker compose up -d
-weight: 500;">docker compose up -d
-weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool
-weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool
-weight: 500;">curl http://localhost:8081/v1/meta | python3 -m json.tool
-weight: 500;">docker run -p 8080:8080 maximhq/bifrost
-weight: 500;">docker run -p 8080:8080 maximhq/bifrost
-weight: 500;">docker run -p 8080:8080 maximhq/bifrost
npx -y @maximhq/bifrost
npx -y @maximhq/bifrost
npx -y @maximhq/bifrost
gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30
gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30
gateway: host: "0.0.0.0" port: 8080 cache: enabled: true type: "semantic" vector_store: provider: "weaviate" host: "http://localhost:8081" conversation_history_threshold: 3 accounts: - id: "production" providers: - id: "openai-main" type: "openai" api_key: "${OPENAI_API_KEY}" model: "gpt-4o" weight: 70 - id: "anthropic-fallback" type: "anthropic" api_key: "${ANTHROPIC_API_KEY}" model: "claude-sonnet-4-20250514" weight: 30
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key"
) # First call - cache miss, hits the LLM provider
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ]
)
print(response.choices[0].message.content)
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key"
) # First call - cache miss, hits the LLM provider
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ]
)
print(response.choices[0].message.content)
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-openai-api-key"
) # First call - cache miss, hits the LLM provider
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Second call - same query, exact cache hit, zero tokens
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What are the benefits of microservices architecture?"} ]
)
print(response.choices[0].message.content) # Third call - different wording, same intent, semantic cache hit
response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "Why should I use a microservices pattern?"} ]
)
print(response.choices[0].message.content)
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key',
}); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ],
}); console.log(response.choices[0].message.content);
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key',
}); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ],
}); console.log(response.choices[0].message.content);
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:8080/v1', apiKey: 'your-openai-api-key',
}); const response = await client.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'user', content: 'Explain container orchestration in simple terms' } ],
}); console.log(response.choices[0].message.content);
-weight: 500;">docker logs -f <bifrost-container-id>
-weight: 500;">docker logs -f <bifrost-container-id>
-weight: 500;">docker logs -f <bifrost-container-id> - Docker and Docker Compose installed (docs)
- Weaviate as the vector store for semantic similarity matching
- Bifrost as the LLM gateway with caching enabled
- At least one LLM provider API key (OpenAI, Anthropic, etc.) - Exact hash match - identical queries return cached responses instantly
- Semantic similarity - queries that mean the same thing but are worded differently also hit the cache - cache.enabled: true turns on the dual-layer cache
- cache.type: "semantic" enables both exact hash and semantic similarity (not just exact match)
- vector_store.provider: "weaviate" points to your Weaviate instance
- conversation_history_threshold: 3 controls how much conversation context is used for cache key generation. Default is 3. Higher values mean more context-sensitive cache matching but fewer hits. - Cache hit rate (exact vs semantic)
- Total requests vs routed requests (routed = cache misses that hit a provider)
- Token usage per provider - "How do I deploy to Kubernetes?" and "What is the process for deploying on k8s?"
- "Explain OAuth 2.0" and "How does OAuth2 authentication work?" - Bifrost Semantic Caching Docs - full config reference
- Bifrost Setup Guide - getting started from scratch
- Weaviate Developer Docs - vector store configuration and modules
- Getting Started with Embeddings (HuggingFace) - how sentence embeddings work
- Redis Caching Patterns - general caching concepts for comparison