Tools
Tools: I Spent $400/Month on a Reranker That Made My RAG Worse
2026-02-05
0 views
admin
The Setup: A Classic RAG Mistake ## The $400 Band-Aid ## The Real Problem: Polishing a Turd ## 1. Fixed Chunking (Cost: $0) ## 2. Added Semantic Firewall (Cost: $0, Latency: +12ms) ## 3. Improved Hybrid Search (Cost: $0) ## When Rerankers Actually Help ## The Decision Table I Wish I Had ## The Real Costs Nobody Ever Talks About ## Key Takeaways ## The Numbers That Matter ## Want to Learn More? I thought I was being clever. Our RAG system was hallucinating, and everyone on Twitter was raving about rerankers. Cohere's rerank endpoint looked perfect. Connect to their API make once call and boom, better results! Three weeks and $400 later, my production metrics were worse than before. Here's what I learned the hard way about when rerankers actually help, and when they badly patch a broken retrieval system. Our customer support chatbot was giving increasingly bizarre answers. Questions like "How do I cancel my subscription?" were getting responses about our security features or pricing tiers. Loosely related to subscriptions, but completely unhelpful. Checking our cosine similarity scores (0.87, 0.91, 0.93) looked like there were no issues. We thought the problem must be the ordering, right? Wrong. I integrated Cohere's rerank endpoint between retrieval and generation: At $0.002 per query and ~15,000 queries per day, that's $30/day = $900/month in steady state. I started with a smaller test set, but even my initial testing cost me $400 before I realized the truth. The results? Our nDCG@10 metric dropped from 0.72 to 0.68. Latency increased by 250ms. User satisfaction scores didn't budge. I was paying to make things worse. Here's what I didn't understand - Rerankers improve precision (ordering), not recall (coverage). When I finally measured it properly, my first-stage retrieval had a recall@50 of 0.61. That means 39% of the time, the correct answer wasn't even in my candidate pool. The reranker was doing exactly what it was designed to do, picking the best chunks from the pool I gave it. The problem was that I was handing it a pool of crap. I was literally asking it to rank: None of these were about cancellation. The reranker dutifully picked "pricing tiers" as the best match, and our LLM hallucinated an answer about downgrading plans instead of canceling. The rule I learned: A reranker on bad retrieval is like polishing a turd - you're just making a shinier turd. To fix it I have to I kill the reranker and get back to first principles. Our chunks were too large (800 tokens) and cut mid-sentence. I switched to semantic chunking that respected document structure: This alone improved recall@50 from 0.61 to 0.78. I started measuring Semantic Stress (ΔS) - the distance between query intent and chunk relevance: Chunks with ΔS > 0.60 were getting rejected before they could poison the context. Simple, fast, and actually effective. I was only using dense embeddings. Adding BM25 for keyword matching improved recall@50 to 0.87: Now my candidate pool actually contains the right answers. After fixing retrieval, I tried adding a reranker again. Although having been burned by API reranking I started with a self-hosted cross-encoder. Being more in control of how the reranking is taking place will help me learn a lot more, if it's not enough then I will expore cohere again. It worked because now I was reranking from a pool where the right answer was actually present. Senior Management struggle to see the hidden cost about trying to cut costs, if cohere was solving our problem, we still would have got pushback when our low level tests cost $400 a month. Obviously the API wasn't cheap but adding a single API call to a code base is extremely simple in engineering terms. The following costs are hard to quantify or be taken seriously by some types of leadership: The $400 in API costs was the cheapest part of this mistake. After fixing retrieval (no reranker): After adding self-hosted reranker: This is just one module from my comprehensive RAG debugging course, where I cover: Check out the full course: RAG Firewall Guide on Gumroad Plus, grab the free GitHub repo with working code examples:
github.com/jongmoss/rag-firewall-examples Have you made expensive mistakes optimizing RAG systems? I'd love to hear your war stories in the comments below. And if you're currently debugging a RAG system that's hallucinating despite high similarity scores - go measure your recall@50 and ΔS before you buy that reranker. Trust me. 😅 Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
# Seemed so simple...
candidates = retriever.search(query, k=50)
reranked = cohere.rerank( query=query, documents=[c['text'] for c in candidates], top_n=10, model='rerank-english-v2.0'
) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Seemed so simple...
candidates = retriever.search(query, k=50)
reranked = cohere.rerank( query=query, documents=[c['text'] for c in candidates], top_n=10, model='rerank-english-v2.0'
) COMMAND_BLOCK:
# Seemed so simple...
candidates = retriever.search(query, k=50)
reranked = cohere.rerank( query=query, documents=[c['text'] for c in candidates], top_n=10, model='rerank-english-v2.0'
) COMMAND_BLOCK:
# Before: Arbitrary 800-token chunks
chunks = naive_split(doc, chunk_size=800) # After: Section-aware chunking
chunks = chunk_by_headers(doc, min_size=200, max_size=500) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Before: Arbitrary 800-token chunks
chunks = naive_split(doc, chunk_size=800) # After: Section-aware chunking
chunks = chunk_by_headers(doc, min_size=200, max_size=500) COMMAND_BLOCK:
# Before: Arbitrary 800-token chunks
chunks = naive_split(doc, chunk_size=800) # After: Section-aware chunking
chunks = chunk_by_headers(doc, min_size=200, max_size=500) COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('all-MiniLM-L6-v2') def semantic_firewall(query: str, chunks: list, threshold: float = 0.60): q_emb = model.encode(query, normalize_embeddings=True) filtered = [] for chunk in chunks: c_emb = model.encode(chunk['text'], normalize_embeddings=True) cosine = float(util.cos_sim(c_emb, q_emb)[0][0]) delta_s = 1 - cosine if delta_s < threshold: # Lower ΔS = more relevant chunk['delta_s'] = delta_s filtered.append(chunk) return filtered # Usage
candidates = retriever.search(query, k=50)
safe_chunks = semantic_firewall(query, candidates, threshold=0.60) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('all-MiniLM-L6-v2') def semantic_firewall(query: str, chunks: list, threshold: float = 0.60): q_emb = model.encode(query, normalize_embeddings=True) filtered = [] for chunk in chunks: c_emb = model.encode(chunk['text'], normalize_embeddings=True) cosine = float(util.cos_sim(c_emb, q_emb)[0][0]) delta_s = 1 - cosine if delta_s < threshold: # Lower ΔS = more relevant chunk['delta_s'] = delta_s filtered.append(chunk) return filtered # Usage
candidates = retriever.search(query, k=50)
safe_chunks = semantic_firewall(query, candidates, threshold=0.60) COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('all-MiniLM-L6-v2') def semantic_firewall(query: str, chunks: list, threshold: float = 0.60): q_emb = model.encode(query, normalize_embeddings=True) filtered = [] for chunk in chunks: c_emb = model.encode(chunk['text'], normalize_embeddings=True) cosine = float(util.cos_sim(c_emb, q_emb)[0][0]) delta_s = 1 - cosine if delta_s < threshold: # Lower ΔS = more relevant chunk['delta_s'] = delta_s filtered.append(chunk) return filtered # Usage
candidates = retriever.search(query, k=50)
safe_chunks = semantic_firewall(query, candidates, threshold=0.60) COMMAND_BLOCK:
# Hybrid retrieval
dense_results = dense_retriever.search(query, k=100)
sparse_results = bm25_retriever.search(query, k=100) # Reciprocal Rank Fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results])
top_50 = combined[:50] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Hybrid retrieval
dense_results = dense_retriever.search(query, k=100)
sparse_results = bm25_retriever.search(query, k=100) # Reciprocal Rank Fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results])
top_50 = combined[:50] COMMAND_BLOCK:
# Hybrid retrieval
dense_results = dense_retriever.search(query, k=100)
sparse_results = bm25_retriever.search(query, k=100) # Reciprocal Rank Fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results])
top_50 = combined[:50] CODE_BLOCK:
from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) def rerank_topk(query: str, candidates: list, out_k: int = 10): pairs = [(query, c['text']) for c in candidates] scores = reranker.compute_score(pairs, normalize=True) ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [c for c, _ in ranked[:out_k]] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) def rerank_topk(query: str, candidates: list, out_k: int = 10): pairs = [(query, c['text']) for c in candidates] scores = reranker.compute_score(pairs, normalize=True) ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [c for c, _ in ranked[:out_k]] CODE_BLOCK:
from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) def rerank_topk(query: str, candidates: list, out_k: int = 10): pairs = [(query, c['text']) for c in candidates] scores = reranker.compute_score(pairs, normalize=True) ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [c for c, _ in ranked[:out_k]] - Chunk about security features (ΔS = 0.72)
- Chunk about pricing tiers (ΔS = 0.68)
- Chunk about account settings (ΔS = 0.71) - Cost: ~$30/month (self-hosted GPU) vs. $400/month (Cohere API)
- Latency: +35ms vs. +250ms
- Quality: nDCG@10 improved from 0.87 to 0.93 (vs. 0.72 → 0.68 before) - Engineering time - 2 weeks debugging why quality dropped
- Opportunity cost - Could have fixed retrieval in day one
- Production incidents - 3 escalations from confused customer support
- Credibility - Having to explain to my VP why we spent money to make things worse - Measure recall first - If recall@50 < 0.85, don't even think about reranking
- Use ΔS as a gate - Filter out high-stress chunks before they poison your context
- Self-host when possible - Cross-encoders are 30x cheaper than LLM APIs
- Fix fundamentals first - Good chunking and hybrid search >>> expensive rerankers
- Verify with metrics - nDCG improvement should be at least 0.05 to justify the complexity - Recall@50: 0.61
- nDCG@10: 0.72
- Average ΔS: 0.68
- Monthly cost: $900
- User satisfaction: 3.2/5 - Recall@50: 0.87
- nDCG@10: 0.87
- Average ΔS: 0.42
- Monthly cost: $30
- User satisfaction: 4.1/5 - Recall@50: 0.87 (unchanged)
- nDCG@10: 0.93
- Average ΔS: 0.38
- Monthly cost: $60
- User satisfaction: 4.4/5 - How to measure and fix Semantic Stress (ΔS)
- Building semantic firewalls to prevent hallucinations
- When (and when not) to use rerankers
- Multi-stage retrieval pipelines that actually work
- Production-grade citation tracking
- A/B testing RAG systems properly
how-totutorialguidedev.toaillmfirewallswitchgitgithub