Tools

Semantic Caching Cut Our LLM Costs by 40%

2025-12-12 0 views admin

Semantic Caching Cut Our LLM Costs by 40%

Source: Dev.to

The Problem ## Enter Semantic Caching ## Real Results ## When Semantic Caching Works ## Implementation in Bifrost ## 1. Configure caching ## 2. That’s it ## Dashboard visibility ## The Tradeoffs ## Cost Breakdown ## Tuning the Similarity Threshold ## Our test results ## Cache Invalidation ## Try It Yourself ## The Bottom Line Our agent answers questions about product documentation. Users ask the same questions differently: All three hit the LLM. Same answer, three separate API calls. Three separate charges. Exact-match caching doesn't help because the queries aren't identical. We needed something smarter. Instead of matching exact strings, semantic caching matches meaning. `Query 1: "How do I reset my password?" → No cache hit → Call LLM ($0.002) → Cache: embedding + response Query 2: "What's the password reset process?" → Similarity: 0.97 (cache hit!) → Return cached response ($0.000) → Saved: $0.002, 800ms latency Query 3: "I forgot my password, help?" → Similarity: 0.96 (cache hit!) → Return cached response ($0.000) → Saved: $0.002, 800ms latency` Our production numbers (30 days): `Total requests: 45,000 Cache hits: 18,000 (40%) Cost saved: $76 Latency saved: ~14,400 seconds Average response time: ├─ Cache miss: 1.2s └─ Cache hit: 0.05s (24x faster)` 40% cache hit rate might not sound impressive, but that's 40% of requests that are instant and free. Bifrost has semantic caching built in. Here’s how to enable it. The gateway automatically: The Bifrost dashboard shows: At a 40% cache hit rate (18,000 hits): Embedding costs for all queries (45,000 total): Even accounting for embedding costs, the savings are substantial. Similarity threshold selection is critical: Start at 0.95, then tune based on your accuracy and freshness requirements. Time-based (TTL): Set expiration time. Good for data that changes predictably. { "ttl_seconds": 3600 // 1 hour } Manual invalidation: Clear cache when you update documentation or data sources. curl -X POST http://localhost:8080/cache/clear Selective clearing: Tag cache entries by topic, clear specific topics when updated. Bifrost is open source and MIT licensed: git clone https://github.com/maximhq/bifrost cd bifrost docker compose up Enable semantic caching in the UI settings. Monitor the dashboard to see cache hit rates and cost savings in real-time. Full implementation details in the GitHub repo. Semantic caching is the easiest optimization we've implemented: If you're making repeated LLM calls with similar queries, semantic caching pays for itself immediately. Built by the team at Maxim AI. We also build evaluation and observability tools for AI agents. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: { "cache": { "enabled": true, "similarity_threshold": 0.95, "ttl_seconds": 3600, "embedding_model": "text-embedding-3-small" } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "cache": { "enabled": true, "similarity_threshold": 0.95, "ttl_seconds": 3600, "embedding_model": "text-embedding-3-small" } } CODE_BLOCK: { "cache": { "enabled": true, "similarity_threshold": 0.95, "ttl_seconds": 3600, "embedding_model": "text-embedding-3-small" } } CODE_BLOCK: Cost calculation: ├─ Embedding: $0.00002 per query ├─ LLM call: $0.00200 per query └─ Savings per cache hit: $0.00198 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Cost calculation: ├─ Embedding: $0.00002 per query ├─ LLM call: $0.00200 per query └─ Savings per cache hit: $0.00198 CODE_BLOCK: Cost calculation: ├─ Embedding: $0.00002 per query ├─ LLM call: $0.00200 per query └─ Savings per cache hit: $0.00198 CODE_BLOCK: 18,000 × $0.00198 = $35.64 saved Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 18,000 × $0.00198 = $35.64 saved CODE_BLOCK: 18,000 × $0.00198 = $35.64 saved CODE_BLOCK: 45,000 × $0.00002 = $0.90 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 45,000 × $0.00002 = $0.90 CODE_BLOCK: 45,000 × $0.00002 = $0.90 CODE_BLOCK: $35.64 − $0.90 = $34.74 saved per 45k queries Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: $35.64 − $0.90 = $34.74 saved per 45k queries CODE_BLOCK: $35.64 − $0.90 = $34.74 saved per 45k queries CODE_BLOCK: 0.90 → High hit rate, higher risk of incorrect cache hits 0.95 → Balanced (our recommended default) 0.98 → Safer, but lower hit rate Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 0.90 → High hit rate, higher risk of incorrect cache hits 0.95 → Balanced (our recommended default) 0.98 → Safer, but lower hit rate CODE_BLOCK: 0.90 → High hit rate, higher risk of incorrect cache hits 0.95 → Balanced (our recommended default) 0.98 → Safer, but lower hit rate CODE_BLOCK: 0.90 → 52% hit rate, ~3% incorrect cache hits 0.95 → 40% hit rate, ~0.5% incorrect cache hits 0.98 → 28% hit rate, ~0.1% incorrect cache hits Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 0.90 → 52% hit rate, ~3% incorrect cache hits 0.95 → 40% hit rate, ~0.5% incorrect cache hits 0.98 → 28% hit rate, ~0.1% incorrect cache hits CODE_BLOCK: 0.90 → 52% hit rate, ~3% incorrect cache hits 0.95 → 40% hit rate, ~0.5% incorrect cache hits 0.98 → 28% hit rate, ~0.1% incorrect cache hits - "How do I reset my password?" - "What's the password reset process?" - "I forgot my password, how do I change it?" - Generate embedding for incoming query - Search for similar queries in cache (cosine similarity) - If similarity > threshold (e.g., 0.95), return cached response - Otherwise, call LLM and cache the result - Documentation Q&A (same questions, different wording) - Customer support (common issues asked repeatedly) - Code explanation (similar code patterns) - Translation tasks (same phrases) - Queries requiring current data ("today's weather") - Personalized responses (user-specific context) - Creative generation (want variety, not cached outputs) - Low-traffic endpoints (not enough queries to benefit) - Generates embeddings for incoming queries - Searches the cache for semantically similar queries - Returns cached responses when similarity exceeds the threshold - Updates the cache with new responses on cache misses - Cache hit rate - Cost savings - Latency improvements - Massive cost savings (~40% in our case) - Much faster responses (~24× faster in our case) - Zero application code changes required - Embedding generation adds ~50 ms latency on cache misses - Cache storage costs (minimal — embeddings are small) - Potential for stale responses if underlying data changes frequently - Zero code changes - 40% cost reduction - 24x faster responses on cache hits

🏷️ Tags

how-totutorialguidedev.toaillmdockergitgithub