Tools

Tools: I Built a Semantic Cache That Cuts LLM API Costs by 72% - What Actually Worked and What Didn't

2026-03-04 0 views admin

Tools: I Built a Semantic Cache That Cuts LLM API Costs by 72% - What Actually Worked and What Didn't

Source: Dev.to

The Results First ## V1: The Elegant Idea That Cost 3x More ## V2: Simple FAISS — Fast but Blind ## V3: The Hybrid That Actually Works ## The 100-Query Benchmark ## What the Numbers Don't Tell You ## The Technical Stack ## What I'd Build Differently ## Try It ## vinaybudideti / intent-atoms ## Sub-query level semantic caching for LLM APIs — 3-tier hybrid engine with FAISS vector search. 87.5% cache hit rate, 71.8% cost savings on 100 real API calls. ## Intent Atoms ## Benchmark Results (Live API — 100 Queries) ## Cache Warm-Up Curve 100 real Anthropic API calls. Three architectures tested. One that actually worked. V3 Hybrid Engine — 100-query live benchmark: The warm-up curve is the real story. The cache starts cold at 42.9% hit rate on the first 10 queries. By query 20: 90%. By query 31: every single query hits cache. Queries 31–40 cost $0.00 — not approximately zero, literally zero dollars. The system is called Intent Atoms. It sits between your application and any LLM API, using FAISS vector search and MPNet embeddings to match incoming queries against cached responses. When it finds a match, it returns the cached response in ~97ms instead of waiting 8–25 seconds for a fresh generation. But the 87.5% number is the end of the story. The beginning was much uglier. My original hypothesis: most LLM queries are compound. "How do I deploy a React app with Docker on AWS?" is really three questions — React builds, Docker containerization, AWS deployment. If I decompose queries into these atomic intents, cache each fragment, and recompose them for new queries, I could reuse fragments across completely different questions. I built the full pipeline: decompose with Haiku, embed atoms, match via FAISS, generate missing atoms with Sonnet, compose fragments into a response. 10-query live benchmark results: Negative cost savings. The decomposition overhead — an extra Haiku call to break the query apart, another to compose the response — exceeded whatever the cache saved. Every query was paying for three LLM calls instead of one. The decomposer did catch overlaps that simpler systems missed. Query 5 ("Deploy Flask with Docker") matched atoms from Query 1 ("Deploy React with Docker") — the Docker and AWS atoms were reused. But the overhead of decomposing and composing ate the savings alive. After V1 failed, I read five research papers. GPTCache, MeanCache, GPT Semantic Cache, and two 2025 papers on domain-specific embeddings and cache eviction. The universal finding: every successful system caches at the full query level with FAISS vector search. None use sub-query decomposition. V2 was radically simple: embed the full query with MPNet, search FAISS, return cached response if similarity > 0.83, otherwise generate and cache. 10-query live benchmark results: Only one hit in 10 queries — Query 10, an exact repeat of Query 1 (similarity = 1.000, cost = $0.000, time = 58ms). Every other query missed because the semantic variations were too different for the threshold to catch. V2 proved the mechanism worked — that one hit was free and instant. But 10% hit rate isn't useful. The answer was combining both approaches. Use fast full-query matching as the primary strategy, and only fall back to expensive atom decomposition when the query is genuinely novel. Three tiers, cheapest first: The adaptation tier is where the real savings come from. When someone asks "How to deploy a Flask app with Docker on AWS?" and I already have a cached response for "How to deploy a React app with Docker on AWS?", the similarity is 0.739 — too low for a direct hit, but the answer is 80% the same. A cheap Haiku call adapts it for $0.002 instead of $0.015 for full Sonnet generation. That's an 87% cost reduction on that single query. 10-query live comparison (all three engines, same queries): On 10 queries, V3 is cheapest and fastest. But the real proof is at scale. 10 topics, 10 paraphrases each, shuffled randomly, real Anthropic API calls with real money. Cache warm-up curve (cost and hit rate per 10-query block): Block 31–40 is the highlight. Ten queries, ten direct cache hits, zero dollars. By this point the cache has seen enough variations of each topic that every new paraphrase lands above the 0.85 similarity threshold. The dip at block 41–50 (90% hit rate, $0.019) is real — one query introduced a new enough phrasing to trigger an adaptation instead of a direct hit. But even that adapted response costs only $0.002–0.004 vs $0.007–0.028 for a full generation. Tier breakdown across all 100 queries: 91 out of 100 queries served from cache. Total: $0.244 vs $0.866 without caching. The benchmark is favorable. 10 topics × 10 paraphrases means high overlap by design. Real-world query distributions have a long tail — 60–70% of queries are unique in most production systems. The research shows that semantic caching works best on narrow, repetitive domains: customer support bots, educational platforms, internal knowledge bases. An EdTech study achieved 45.1% hit rate on real student queries — not 87%. That's a more realistic number for production. Atom-level matching barely fires. Only 2 of 100 queries reached the atom decomposition layer. The full-query matching with adaptation handles almost everything. Layer 2 adds complexity without proportional value in this benchmark. The decomposer is inconsistent. Haiku sometimes produces different atom breakdowns for similar queries, which reduces atom-level cache hits. This was the core problem with V1 and it persists as a fallback limitation in V3. The embedding model choice matters enormously. SHA-256 hashes (my V1 mistake) gave 0% hit rate. MPNet gives 87.5%. Same architecture, completely different results. Skip atom decomposition entirely. The 2-hit contribution from Layer 2 doesn't justify the code complexity. A two-tier system (direct hit + adaptation) would achieve nearly identical results with half the codebase. Add conversation context. The current system treats each query independently. Follow-up questions like "What about using Kubernetes instead?" require the prior context to make sense. MeanCache addresses this with context chains — worth implementing for any production deployment. Fine-tune the embedding model. The 2025 domain-specific embeddings paper showed that general-purpose models miss domain paraphrases. "Containerize a React frontend" and "Put React in a Docker image" are the same intent but look different to MPNet. Domain-specific fine-tuning would push the similarity scores higher and catch more near-misses. The code is open source under MIT license. Sub-query level semantic caching for LLM APIs with FAISS vector search. Reduce API costs by up to 71.8% with a hybrid 3-tier caching engine that matches at the full-query, adapted, and atomic intent levels. Tested on 100 real Anthropic API calls: 87.5% cache hit rate, $0.24 vs $0.87 without cache, 54 zero-cost direct hits. The cache starts cold and improves with every query. By query 30, hit rate reaches 100% per block: Live dashboard: intent-atoms.vercel.app Includes all three engine versions, the complete benchmark suite with terminal screenshots, a FastAPI REST API, and a React analytics dashboard. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: git clone https://github.com/vinaybudideti/intent-atoms.git cd intent-atoms python -m venv venv && source venv/bin/activate pip install -r requirements.txt python tests/benchmark_100.py # run the 100-query benchmark yourself Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: git clone https://github.com/vinaybudideti/intent-atoms.git cd intent-atoms python -m venv venv && source venv/bin/activate pip install -r requirements.txt python tests/benchmark_100.py # run the 100-query benchmark yourself COMMAND_BLOCK: git clone https://github.com/vinaybudideti/intent-atoms.git cd intent-atoms python -m venv venv && source venv/bin/activate pip install -r requirements.txt python tests/benchmark_100.py # run the 100-query benchmark yourself - Tier 1 — Direct hit (similarity > 0.85): Return cached response. Zero cost. ~97ms. This caught 54 of 100 queries. - Tier 2 — Adapt (similarity 0.70–0.85): Take the closest cached response and use Haiku to tweak it for the new query. ~$0.002. This caught 35 queries. - Tier 3 — Full miss (similarity < 0.70): Fall through to atom-level decomposition. Only 9 queries reached this tier. - Embeddings: sentence-transformers/all-mpnet-base-v2 (768-dim, runs locally — no API cost for embedding) - Vector search: FAISS IndexFlatIP (cosine similarity via inner product on normalized vectors) - LLM providers: Anthropic Claude — Haiku for cheap operations (decompose, adapt, compose), Sonnet for generation - API: FastAPI with async support - Dashboard: React + Recharts, deployable to Vercel - Persistence: JSON metadata + binary FAISS index files

🏷️ Tags

how-totutorialguidedev.toaimlllmgptdockerpythonkubernetesgitgithub