Tools
Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked
2025-12-28
0 views
admin
Why Redis Caching Makes Sense in RAG ## Text Equality Is Not Intent Equality ## What Is a Normalized Query (Really)? ## An Example of Normalization Function ## A Better Cache Key ## Where Semantic Caching Tempted Me (& Why It’s Risky) ## Where Semantic Caching Can Work (Carefully) ## The Normalization Layer (The Missing Piece) ## What Actually Worked in Practice ## Takeaways ## Final Thoughts When I first added Redis caching to my RAG API, the motivation was simple: latency was creeping up, costs were rising and many questions looked repetitive. Caching felt like the obvious win. But once I went beyond the happy path, I realized caching in RAG isn’t about Redis at all. It’s about what you choose to cache and how safely you decide two queries are “the same”. This post walks through: RAG pipelines are expensive because they repeatedly do the same things: For many user questions, especially in internal tools: the answer doesn’t change between requests So the first version of my cache looked like this: Why this doesn't work. You know it. These queries are clearly the same: But Redis treats them as different keys. That’s when the idea of a normalized query enters the picture. A normalized query about stripping away presentation noise while preserving intent. Dangerous normalizations: This intentionally avoids: Boring. Predictable. Correct. Text alone is still not enough. A correct cache key must capture how the answer was produced, not just the question. At some point, I considered: "What if I reuse answers for similar questions?" This is semantic caching. Example: They feel similar. But semantic similarity is probabilistic, not deterministic. For production RAG, that’s dangerous. Semantic caching is acceptable when: The biggest realization for me was this: Normalization is not a function; it’s a layer. Especially when RAG involves: Then hash a canonical form. My final setup looked like this: Caching in RAG isn’t about saving tokens. It’s about engineering discipline. If we get normalization right, Redis becomes a superpower. If we don’t, caching becomes a liability. Thanks for reading. Mahak p.s. This is a deceptively hard problem, and there’s no one-size-fits-all solution. Different RAG setups demand different normalization strategies depending on how context is retrieved, structured & validated. In my own project, this exact approach didn’t work out of the box, the real implementation was far more constrained & nuanced. What I’ve shared here is the idea and way of thinking that helped me reason about the problem, not a drop-in solution. Production-grade systems inevitably require careful, system-specific trade-offs. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or CODE_BLOCK: cache_key = hash(user_query) CODE_BLOCK: cache_key = hash(user_query) CODE_BLOCK: cache_key = hash(user_query) COMMAND_BLOCK: import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() COMMAND_BLOCK: import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() COMMAND_BLOCK: import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() CODE_BLOCK: cache_key = hash( model_name + normalized_query + retrieval_config ) CODE_BLOCK: cache_key = hash( model_name + normalized_query + retrieval_config ) CODE_BLOCK: cache_key = hash( model_name + normalized_query + retrieval_config ) CODE_BLOCK: "How does Redis caching work in RAG?" "Explain caching strategy for RAG systems" CODE_BLOCK: "How does Redis caching work in RAG?" "Explain caching strategy for RAG systems" CODE_BLOCK: "How does Redis caching work in RAG?" "Explain caching strategy for RAG systems" CODE_BLOCK: { "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" } } CODE_BLOCK: { "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" } } CODE_BLOCK: { "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" } } - why Redis caching works for RAG - what a normalized query really means - why semantic caching is tempting but dangerous - and how a proper normalization layer keeps correctness intact - embedding generation - vector retrieval - context assembly - LLM inference - sub-millisecond reads - TTL-based eviction - simple operational model - predictable cost - "Explain docker networking" - "Can you explain Docker networking?" - "docker networking explained" - improve cache hit rate - without returning wrong answers - lowercasing - trimming whitespaces - removing punctuation - collapsing filler phrases - removing numbers - collapsing versions - replacing domain terms - synonym substitution - semantic guessing In RAG, wrong cache hits are worse than cache misses. - NLP stopword lists - synonym expansion - reusing answers across models - mixing retrieval strategies - silent correctness bugs - incorrect reuse - subtle hallucinations - hard-to-debug failures - broken trust - questions are FAQs - answers are generic - correctness tolerance is high - fallback to exact cache exists - The safe pattern is two-tier caching: - Exact cache (normalized query) - Semantic cache (optional, guarded) - Retrieval fallback Never semantic-cache authoritative answers. - SQL / Athena - metrics In those cases, the “query” isn’t text anymore. It’s intent + constraints. Instead of caching raw SQL, normalize the logical query shape: - This makes caching: - deterministic - Redis for fast cache - conservative text normalization - intent-level normalization for structured queries - no semantic caching for critical paths - TTL aligned with data freshness - ~40% cost reduction - lower latency - zero correctness regressions - predictable behavior - Most importantly, I trusted my system again. - Redis caching is easy — correct caching is not - Normalize form, not meaning - Over-normalization silently breaks RAG - Semantic caching should be optional, not default - Structured queries need intent-level normalization - Determinism beats cleverness
toolsutilitiessecurity toolsrediscachingnormalizedqueriessemantictrapsactually