Tools
Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked
2025-12-28
0 views
admin
Why Redis Caching Makes Sense in RAG ## Text Equality Is Not Intent Equality ## What Is a Normalized Query (Really)? ## An Example of Normalization Function ## A Better Cache Key ## Where Semantic Caching Tempted Me (& Why It’s Risky) ## Where Semantic Caching Can Work (Carefully) ## The Normalization Layer (The Missing Piece) ## What Actually Worked in Practice ## Takeaways ## Final Thoughts When I first added Redis caching to my RAG API, the motivation was simple: latency was creeping up, costs were rising and many questions looked repetitive. Caching felt like the obvious win.
But once I went beyond the happy path, I realized caching in RAG isn’t about Redis at all. It’s about what you choose to cache and how safely you decide two queries are “the same”. This post walks through: RAG pipelines are expensive because they repeatedly do the same things: For many user questions, especially in internal tools:
the answer doesn’t change between requests So the first version of my cache looked like this: Why this doesn't work. You know it. These queries are clearly the same: But Redis treats them as different keys.
That’s when the idea of a normalized query enters the picture. A normalized query about stripping away presentation noise while preserving intent. Dangerous normalizations: This intentionally avoids: Boring. Predictable. Correct. Text alone is still not enough.
A correct cache key must capture how the answer was produced, not just the question. At some point, I considered:
"What if I reuse answers for similar questions?"
This is semantic caching.
Example: They feel similar.
But semantic similarity is probabilistic, not deterministic. For production RAG, that’s dangerous. Semantic caching is acceptable when: The biggest realization for me was this:
Normalization is not a function; it’s a layer. Especially when RAG involves: Then hash a canonical form. My final setup looked like this: Caching in RAG isn’t about saving tokens.
It’s about engineering discipline. If we get normalization right, Redis becomes a superpower.
If we don’t, caching becomes a liability. Thanks for reading.
Mahak p.s. This is a deceptively hard problem, and there’s no one-size-fits-all solution. Different RAG setups demand different normalization strategies depending on how context is retrieved, structured & validated. In my own project, this exact approach didn’t work out of the box, the real implementation was far more constrained & nuanced. What I’ve shared here is the idea and way of thinking that helped me reason about the problem, not a drop-in solution. Production-grade systems inevitably require careful, system-specific trade-offs. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
cache_key = hash(user_query) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
cache_key = hash(user_query) CODE_BLOCK:
cache_key = hash(user_query) COMMAND_BLOCK:
import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() COMMAND_BLOCK:
import re FILLER_PHRASES = ["can you", "please", "tell me", "explain"] def normalize_query(query: str) -> str: q = query.lower().strip() for phrase in FILLER_PHRASES: q = q.replace(phrase, "") q = re.sub(r"[^\w\s]", "", q) q = re.sub(r"\s+", " ", q) return q.strip() CODE_BLOCK:
cache_key = hash( model_name + normalized_query + retrieval_config
) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
cache_key = hash( model_name + normalized_query + retrieval_config
) CODE_BLOCK:
cache_key = hash( model_name + normalized_query + retrieval_config
) CODE_BLOCK:
"How does Redis caching work in RAG?"
"Explain caching strategy for RAG systems" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
"How does Redis caching work in RAG?"
"Explain caching strategy for RAG systems" CODE_BLOCK:
"How does Redis caching work in RAG?"
"Explain caching strategy for RAG systems" CODE_BLOCK:
{ "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" }
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" }
} CODE_BLOCK:
{ "source": "athena", "table": "deployments", "metrics": ["count"], "filters": { "status": "FAILED", "time_range": "LAST_7_DAYS" }
} - why Redis caching works for RAG
- what a normalized query really means
- why semantic caching is tempting but dangerous
- and how a proper normalization layer keeps correctness intact - embedding generation
- vector retrieval
- context assembly
- LLM inference - sub-millisecond reads
- TTL-based eviction
- simple operational model
- predictable cost - "Explain docker networking"
- "Can you explain Docker networking?"
- "docker networking explained" - improve cache hit rate
- without returning wrong answers - lowercasing
- trimming whitespaces
- removing punctuation
- collapsing filler phrases - removing numbers
- collapsing versions
- replacing domain terms
- synonym substitution
- semantic guessing
In RAG, wrong cache hits are worse than cache misses. - NLP stopword lists
- synonym expansion - reusing answers across models
- mixing retrieval strategies
- silent correctness bugs - incorrect reuse
- subtle hallucinations
- hard-to-debug failures
- broken trust - questions are FAQs
- answers are generic
- correctness tolerance is high
- fallback to exact cache exists
- The safe pattern is two-tier caching:
- Exact cache (normalized query)
- Semantic cache (optional, guarded)
- Retrieval fallback
Never semantic-cache authoritative answers. - SQL / Athena
- metrics
In those cases, the “query” isn’t text anymore.
It’s intent + constraints.
Instead of caching raw SQL, normalize the logical query shape: - This makes caching:
- deterministic - Redis for fast cache
- conservative text normalization
- intent-level normalization for structured queries
- no semantic caching for critical paths
- TTL aligned with data freshness - ~40% cost reduction
- lower latency
- zero correctness regressions
- predictable behavior
- Most importantly, I trusted my system again. - Redis caching is easy — correct caching is not
- Normalize form, not meaning
- Over-normalization silently breaks RAG
- Semantic caching should be optional, not default
- Structured queries need intent-level normalization
- Determinism beats cleverness
how-totutorialguidedev.toaillmnlpnetworknetworkingdocker