Tools

Tools: The lost-in-the-middle problem and why retrieval beats stuffing

2026-03-01 0 views admin

Tools: The lost-in-the-middle problem and why retrieval beats stuffing

Source: Dev.to

The research says your middle context is a dead zone ## What this means for your agent ## Stuffing vs. retrieval: a real comparison ## Why "just use a bigger context window" doesn't fix this ## What actually works ## The numbers ## Start with the expensive memories first Your agent has a 200K token context window. So you dump everything in there — MEMORY.md, daily logs, project notes, old conversations — and figure the model will sort it out. It won't. In 2023, researchers from Stanford, UC Berkeley, and Samaya AI published "Lost in the Middle: How Language Models Use Long Contexts." They tested models on tasks where the relevant information was placed at different positions in the input. The results were consistent: models performed best when key information appeared at the very beginning or the very end of the context. Information in the middle got ignored. This wasn't a fluke finding. Nelson Liu and the team tested across multiple model families and context lengths. Performance degraded significantly — sometimes by 20% or more — when the answer was buried in the middle third of the input. Google DeepMind followed up with similar findings. So did Anthropic's own internal research on Claude's attention patterns. The pattern holds: long context doesn't mean good context. If you're loading 50KB of MEMORY.md into every session, here's what actually happens: That preference you stored six months ago about using TypeScript? It's sitting in paragraph 47 of your memory file. The model probably won't notice it when it matters. The math makes it worse. A 50KB MEMORY.md is roughly 12,500 tokens. At $3 per million input tokens (Claude Sonnet pricing), that's about $0.04 per session just to load memories your agent might not even use. Run 50 sessions a day and you're spending $2/day on context that's partially invisible to the model. Stuffing approach (MEMORY.md): Retrieval approach (MemoClaw recall): The retrieval approach uses roughly 8% of the tokens and puts them where the model actually pays attention — right before the conversation starts. Every few months, someone announces a longer context window. Gemini hit 1M tokens. Claude went to 200K. GPT-4 Turbo did 128K. And every time, people assume the memory problem is solved. It isn't. Longer windows don't change the attention distribution. They make the middle-zone problem worse because there's more middle to lose things in. A 1M token context with your answer at position 500K is worse than a 4K context with your answer at position 2K. The lost-in-the-middle researchers tested this explicitly. Extending context length didn't improve retrieval from the middle. It just gave models more text to skim past. The fix isn't bigger contexts. It's smaller, targeted contexts with the right information. With MemoClaw, instead of loading everything, you recall what's relevant: You get back 5-10 semantically matched memories. You inject those at the start of your prompt. The model sees exactly what it needs, right where it pays the most attention. For an OpenClaw agent, this looks like: The token cost drops from ~12,500 to ~800. The relevant information moves from "somewhere in the middle" to "right at the top." The model stops missing things. Here's a side-by-side for an agent running 30 sessions per day over a month: You save about $27/month per agent and your agent actually remembers the things that matter. You don't have to migrate everything at once. Start with the memories your agent keeps forgetting: Move those to MemoClaw, keep the rest in MEMORY.md for now, and see if your agent starts getting things right more often. If you've got an OpenClaw agent running, install the skill and run a migration: Your context window is expensive real estate. Stop filling it with things the model won't read. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: memoclaw recall "user's TypeScript preferences" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: memoclaw recall "user's TypeScript preferences" CODE_BLOCK: memoclaw recall "user's TypeScript preferences" CODE_BLOCK: memoclaw migrate ~/path/to/MEMORY.md --namespace my-project Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: memoclaw migrate ~/path/to/MEMORY.md --namespace my-project CODE_BLOCK: memoclaw migrate ~/path/to/MEMORY.md --namespace my-project - The model reads the first few thousand tokens carefully - Attention drops off through the middle - It picks back up near the end, where your actual conversation starts - Load everything every session: ~12,500 tokens - Model sees all memories but attends unevenly to them - Cost: $0.04 per session regardless of relevance - Old memories compete with new ones for attention - Query for relevant memories: 5-10 results, ~500-1,000 tokens - Model sees only what's relevant to the current conversation - Cost: $0.005 per recall + ~$0.003 in input tokens - Important memories surface when they're actually needed - Session starts - Agent calls recall with a query about the current task - Gets back relevant memories (preferences, past decisions, corrections) - Those go into the system prompt, before the conversation - Agent works with full context on what matters, zero noise from six months of irrelevant notes - User corrections ("I prefer tabs over spaces" stored with importance 0.9) - Project-specific context that only matters for one workspace - Preferences that were set months ago and keep getting lost in the file - Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023). arXiv:2307.03172 - Pricing based on Anthropic Claude 3.5 Sonnet rates as of early 2026.

🏷️ Tags

how-totutorialguidedev.toaigpt