Tools

Tools: Stop Feeding "Junk" Tokens to Your LLM. (I Built a Proxy to Fix It)

2026-01-18 0 views admin

Tools: Stop Feeding "Junk" Tokens to Your LLM. (I Built a Proxy to Fix It)

Source: Dev.to

Why Truncation and Summarization Don't Work ## The Core Idea: Statistical Analysis, Not Blind Truncation ## CCR: Making Compression Reversible ## 1. Compress ## 2. Cache ## 3. Retrieve ## TOIN: The Network Effect ## Memory: Cross-Conversation Learning ## The Transform Pipeline ## 1. CacheAligner ## 2. SmartCrusher ## 3. ContentRouter ## 4. RollingWindow ## Three Ways to Use It ## Option 1: Proxy Server (Zero Code Changes) ## Option 2: SDK Wrapper ## Option 3: Framework Integrations ## Real Numbers ## What's Coming Next ## Why I Built This I recently built an agent to handle some SRE tasks—fetching logs, querying databases, searching code. It worked, but when I looked at the traces, I was annoyed. It wasn't just that it was expensive (though the bill was climbing). It was the sheer inefficiency. I looked at a single tool output—a search for Python files. It was 40,000 tokens. About 35,000 of those tokens were just "type": "file" and "language": "python" repeated 2,000 times. We are paying premium compute prices to force state-of-the-art models to read standard JSON boilerplate. I couldn't find a tool that solved this without breaking the agent, so I wrote one. It's called Headroom. It's a context optimization layer that sits between your app and your LLM. It compresses context by ~85% without losing semantic meaning. It's open source (Apache-2.0). If you just want the code: github.com/chopratejas/headroom When your context window fills up, the standard industry solution is truncation (chopping off the oldest messages or the middle of the document). But for an agent, truncation is dangerous. I tried summarization (using a cheaper model to summarize the data first), but that introduced hallucination. I had a summarizer tell me a deployment "looked fine" because it ignored specific error codes in the raw log. I needed a third option: Lossless compression. Or at least, "intent-lossless." I realized that 90% of the data in a tool output is just schema scaffolding. The LLM doesn't need to see status: active repeated a thousand times. It needs the anomalies. Headroom's SmartCrusher runs statistical analysis before touching your data: 1. Constant Factoring If every item in an array has "type": "file", it doesn't repeat that 2,000 times. It extracts constants once. 2. Outlier Detection It calculates standard deviation of numerical fields. It preserves the spikes—the values that are >2σ from the mean. Those are usually what matters. 3. Error Preservation Hard rule: never discard strings that look like stack traces, error messages, or failures. Errors are sacred. 4. Relevance Scoring If you searched for "auth", items containing "auth" get preserved. Uses BM25 + semantic embeddings (hybrid scoring) to match items against the user's query context. 5. First/Last Retention Always keeps first few and last few items. The LLM expects to see some examples, and recency matters. The result: 40,000 tokens → 4,000 tokens. Same information density. No hallucination risk. Here's the insight that changed everything: compression should be reversible. I call the architecture CCR (Compress-Cache-Retrieve): SmartCrusher compresses the tool output from 2,000 items to 20. The original 2,000 items are cached locally (5-minute TTL, LRU eviction). Headroom injects a tool called headroom_retrieve() into the LLM's context. If the model looks at the compressed summary and decides it needs more data—maybe the user asked a follow-up question—it can call that tool. Headroom fetches from the cache and returns the relevant items. This changes the risk calculus. You can compress aggressively (90%+) because nothing is ever truly lost. The model can always "unzip" what it needs. I've had conversations like this: No extra API calls. No "sorry, I don't have that information anymore." Here's where it gets interesting. Headroom learns from compression patterns. TOIN (Tool Output Intelligence Network) tracks—anonymously—what happens after compression: This data feeds back into compression recommendations. If TOIN learns that users frequently retrieve error_code fields after compression, it tells SmartCrusher to preserve error_code more aggressively next time. The network effect: more users → more compression events → better recommendations for everyone. Agents often need to remember things across conversations. "I prefer dark mode." "My timezone is PST." "I'm working on the auth refactor." Headroom has a memory system that extracts and stores these facts automatically. Fast Memory (Recommended) Zero extra latency. The LLM outputs a <memory> block inline with its response. Headroom parses it out and stores the memory. Background Memory Separate LLM call extracts memories asynchronously. More accurate but adds latency. Memories are stored locally (SQLite) and injected into future conversations. The model remembers that Bob prefers dark mode without you managing state. Headroom runs four transforms on each request: LLM providers offer cached token pricing (Anthropic: 90% off, OpenAI: 50% off). But caching only works if your prompt prefix is stable. Problem: your system prompt probably has a timestamp. Current time: 2024-01-15 10:32:45. That breaks caching. CacheAligner extracts dynamic content and moves it to the end, stabilizing the prefix. Same information, better cache hits. The statistical compression engine. Analyzes arrays, detects patterns, preserves anomalies, factors constants. Different content needs different compression. Code isn't JSON isn't logs isn't prose. ContentRouter uses ML-based content detection to route data to specialized compressors: When context exceeds the model limit, something has to go. RollingWindow drops oldest tool calls + responses together (never orphans data), preserves system prompt and recent turns. Point your OpenAI client to http://localhost:8787/v1. Done. Works with Claude Code, Cursor, any OpenAI-compatible client. Start with audit to see potential savings, then flip to optimize when you're confident. MCP (Model Context Protocol): I've been running this in production for months. Here's what the token reduction looks like: This is actively maintained. On the roadmap: I'm a believer that we're in the "optimization phase" of the AI hype cycle. Getting things to work is table stakes; getting them to work cheaply and reliably is the actual engineering work. Headroom is my attempt to fix the "context bloat" problem properly. Not with heuristics or truncation, but with statistical analysis and reversible compression. It runs entirely locally. No data leaves your machine (except to OpenAI/Anthropic as usual). Apache-2.0 licensed. Repo: github.com/chopratejas/headroom If you find bugs or have ideas, open an issue. I'm actively maintaining this. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: Turn 1: "Search for all Python files" → 1000 files returned, compressed to 15 Turn 5: "Actually, what was that file handling JWT tokens?" → LLM calls headroom_retrieve("jwt") → Returns jwt_handler.py from cached data Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Turn 1: "Search for all Python files" → 1000 files returned, compressed to 15 Turn 5: "Actually, what was that file handling JWT tokens?" → LLM calls headroom_retrieve("jwt") → Returns jwt_handler.py from cached data CODE_BLOCK: Turn 1: "Search for all Python files" → 1000 files returned, compressed to 15 Turn 5: "Actually, what was that file handling JWT tokens?" → LLM calls headroom_retrieve("jwt") → Returns jwt_handler.py from cached data COMMAND_BLOCK: from headroom.memory import with_fast_memory client = with_fast_memory(OpenAI(), user_id="alice") # Memories extracted automatically from responses # Injected automatically into future requests Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from headroom.memory import with_fast_memory client = with_fast_memory(OpenAI(), user_id="alice") # Memories extracted automatically from responses # Injected automatically into future requests COMMAND_BLOCK: from headroom.memory import with_fast_memory client = with_fast_memory(OpenAI(), user_id="alice") # Memories extracted automatically from responses # Injected automatically into future requests CODE_BLOCK: from headroom import with_memory client = with_memory(OpenAI(), user_id="alice") Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from headroom import with_memory client = with_memory(OpenAI(), user_id="alice") CODE_BLOCK: from headroom import with_memory client = with_memory(OpenAI(), user_id="alice") COMMAND_BLOCK: pip install headroom-ai headroom proxy --port 8787 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: pip install headroom-ai headroom proxy --port 8787 COMMAND_BLOCK: pip install headroom-ai headroom proxy --port 8787 COMMAND_BLOCK: from openai import OpenAI client = OpenAI(base_url="http://localhost:8787/v1") # No other changes Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from openai import OpenAI client = OpenAI(base_url="http://localhost:8787/v1") # No other changes COMMAND_BLOCK: from openai import OpenAI client = OpenAI(base_url="http://localhost:8787/v1") # No other changes COMMAND_BLOCK: from headroom import HeadroomClient from openai import OpenAI client = HeadroomClient(OpenAI()) response = client.chat.completions.create( model="gpt-4o", messages=[...], headroom_mode="optimize" # or "audit" or "simulate" ) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from headroom import HeadroomClient from openai import OpenAI client = HeadroomClient(OpenAI()) response = client.chat.completions.create( model="gpt-4o", messages=[...], headroom_mode="optimize" # or "audit" or "simulate" ) COMMAND_BLOCK: from headroom import HeadroomClient from openai import OpenAI client = HeadroomClient(OpenAI()) response = client.chat.completions.create( model="gpt-4o", messages=[...], headroom_mode="optimize" # or "audit" or "simulate" ) COMMAND_BLOCK: from langchain_openai import ChatOpenAI from headroom.integrations.langchain import HeadroomChatModel base_model = ChatOpenAI(model="gpt-4o") model = HeadroomChatModel(base_model, mode="optimize") # Use in any chain or agent chain = prompt | model | parser Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from langchain_openai import ChatOpenAI from headroom.integrations.langchain import HeadroomChatModel base_model = ChatOpenAI(model="gpt-4o") model = HeadroomChatModel(base_model, mode="optimize") # Use in any chain or agent chain = prompt | model | parser COMMAND_BLOCK: from langchain_openai import ChatOpenAI from headroom.integrations.langchain import HeadroomChatModel base_model = ChatOpenAI(model="gpt-4o") model = HeadroomChatModel(base_model, mode="optimize") # Use in any chain or agent chain = prompt | model | parser CODE_BLOCK: from agno.agent import Agent from headroom.integrations.agno import HeadroomAgnoModel model = HeadroomAgnoModel(original_model, mode="optimize") agent = Agent(model=model, tools=[...]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from agno.agent import Agent from headroom.integrations.agno import HeadroomAgnoModel model = HeadroomAgnoModel(original_model, mode="optimize") agent = Agent(model=model, tools=[...]) CODE_BLOCK: from agno.agent import Agent from headroom.integrations.agno import HeadroomAgnoModel model = HeadroomAgnoModel(original_model, mode="optimize") agent = Agent(model=model, tools=[...]) COMMAND_BLOCK: from headroom.integrations.mcp import compress_tool_result # Compress any tool result before returning to LLM compressed = compress_tool_result(tool_name, result_data) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from headroom.integrations.mcp import compress_tool_result # Compress any tool result before returning to LLM compressed = compress_tool_result(tool_name, result_data) COMMAND_BLOCK: from headroom.integrations.mcp import compress_tool_result # Compress any tool result before returning to LLM compressed = compress_tool_result(tool_name, result_data) - If you chop the middle of a log file, you might lose the one error line that explains the crash. - If you chop a file list, you might lose the exact config file the user asked for. - Which fields get retrieved most often? - Which tool types have high retrieval rates? - What query patterns trigger retrievals? - No actual data values stored - Tool names are structure hashes - Field names are SHA256[:8] hashes - No user identifiers - Code → AST-aware compression (tree-sitter) - JSON → SmartCrusher - Logs → LogCompressor (clusters similar messages) - Text → Optional LLMLingua integration (20x compression, adds latency) - audit: Observe only. Logs what would be optimized, doesn't change anything. - optimize: Apply compression. This is what saves tokens. - simulate: Dry run. Returns the optimized messages without calling the API. - CrewAI integration - AutoGen integration - Semantic Kernel integration - Cloud-hosted TOIN backend (opt-in) - Cross-device memory sync - Team-shared compression patterns - Domain-specific profiles (SRE, coding, data analysis) - Custom compressor plugins - Streaming compression for real-time tools

🏷️ Tags

how-totutorialguidedev.toaimlopenaillmgptkernelservernetworkrouterapachepython