Tools: Optimizing AI Agent Memory: Tiered Context and Aggressive Compaction

Tools: Optimizing AI Agent Memory: Tiered Context and Aggressive Compaction

Source: Dev.to

Optimizing AI Agent Memory: Tiered Context and Aggressive Compaction ## The Problem: Context Window Bloat ## The Solution: Tiered Memory and Early Compaction ## 1. Move Most Context Out of Auto-Loaded Files ## 2. Tighten Compaction Settings ## The Tradeoff: Shorter History, Better Discipline ## The Numbers ## What This Doesn't Solve ## Configuration Example ## Key Takeaways Running an AI assistant in long-running sessions creates a context management problem that most implementations don't really solve. The model's context window fills up with conversation history, you hit the token limit, and then you either truncate aggressively and lose continuity or you keep everything and pay for massive cached context on every turn. I'm running OpenClaw for an AI assistant that handles long sessions, and the default conversation compaction settings weren't aggressive enough. The agent was hitting compaction after hours of conversation and racking up costs from tens of thousands of cached tokens on every turn, most of which weren't relevant to what was actually being asked. Here's what I changed and why it works. Most AI agent setups load a base set of instructions into every prompt. Personality, operating rules, tool documentation, memory, whatever else you want the AI to remember. OpenClaw calls these "workspace files" and injects them automatically at the start of every conversation. This works fine for short sessions. It breaks down when you're running the same agent for hours or days at a time, because you end up with this growing pile of context: The conversation history is the real killer. After a few hours of back and forth, you can easily have 50k+ tokens of history sitting there. Claude caches aggressively so you're not paying full price for those tokens every turn, but you're still paying cache read costs and they still count toward the 200k limit. When you finally hit the limit, OpenClaw triggers compaction. It summarizes the conversation history into a shorter block and replaces the original messages with the summary. This works, but if you're only compacting when you're about to hit 200k tokens, you've been dragging around a huge context for way longer than necessary. I rebuilt the agent's context management around two changes. I went through every workspace file and moved detailed content into separate memory files that only get loaded on demand via semantic search. The workspace files now total about 7KB combined. They contain: Everything else went into the memory/ directory: When the agent needs detailed information, it runs a semantic search across the memory files and loads only the relevant chunks. This keeps the base context small and loads additional context only when it's actually needed. The workspace files explicitly enforce search-before-answer discipline: This forced a discipline shift. Instead of relying on conversation history being available, the agent writes important details to memory files immediately and searches them when needed. OpenClaw has a compaction configuration block that controls when and how conversation history gets summarized. Here's what I changed: mode: "safeguard" uses chunked summarization instead of truncating. It breaks the conversation into segments, summarizes each one, and reassembles them. This preserves more continuity than just dropping old messages. reserveTokensFloor: 120000 is the big one. This sets how many tokens to keep free, which determines when compaction triggers. The default was 20k, which meant compaction only kicked in when you were nearly at the 200k limit. Setting it to 120k means compaction fires at around 80k tokens used, keeping the active context window much smaller. softThresholdTokens: 50000 triggers a memory flush when context hits 50k tokens. This is a softer checkpoint - I write any pending details to durable memory files before the context gets any bigger. This prevents losing details that were mentioned in conversation but not yet committed to storage. memoryFlush.enabled: true ensures memory gets flushed before compaction runs, as a safety net. More frequent compaction means less conversational continuity in context. If something was discussed an hour ago, it's probably been summarized by now. The agent can't just scroll back through the conversation to find details, it has to search memory files. This is the tradeoff. Lower per-turn token costs and faster responses, but the AI has to be more deliberate about what it remembers. It can't rely on passive recall from conversation history, it has to actively write things down and search for them later. In practice this works better than expected. The explicit instructions to write details to memory immediately create a forcing function. The agent doesn't wait for compaction to decide what's important, it writes it down as we go. When I ask about something from earlier in the day or from a past session, it runs a memory search and pulls the relevant context. The failure mode is when the agent forgets to write something down and then conversation history gets compacted. That detail is gone unless it made it into the summary. But this hasn't been a major problem because the instructions are clear: write it down now, search before answering. Here's what the setup looks like in practice: Per-turn token costs dropped significantly after these changes. The cached context is smaller, compaction happens more frequently so history doesn't pile up, and memory files only get loaded when relevant. Response latency improved slightly because there's less context to process on each turn. Not a huge difference, but noticeable when you're using it all day. This setup works well for this use case (long-running sessions, lots of back and forth, need to reference past conversations). It doesn't solve every context management problem. If you need perfect conversational continuity across hours of dialogue, this isn't it. Compaction loses nuance. The summaries are good, but they're still summaries. If you're doing something where every detail of the conversation matters, you probably want to keep more history in context and pay the token costs. If your agent setup is mostly short sessions (a few minutes each), this is overkill. The default settings are fine when you're not hitting compaction regularly. If you don't have a good semantic search system for memory files, the on-demand loading doesn't work as well. OpenClaw has memory_search built in, so the agent can just search and load relevant chunks. If you're building this yourself, you need to implement something similar or the AI won't know how to find the information it wrote down. Here's the full OpenClaw compaction config we're using: reserveTokensFloor: 120000 means compaction triggers after about 80k tokens of use. softThresholdTokens: 50000 adds an earlier checkpoint where I flush important context to durable memory before compaction even runs. Two safety nets instead of one. If you're running an AI agent in long sessions and paying attention to token costs, here's what worked: The tradeoff is shorter conversational continuity in exchange for lower token costs and better long-term recall. For this use case, that's the right trade. Your setup might be different. If you're running OpenClaw or building something similar, this configuration might be worth trying. If you're using a different platform, the principles should translate: keep base context small, compact aggressively, write to durable memory early, search when you need details. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: ## Memory Strategy - **Daily:** `memory/YYYY-MM-DD.md` for session logs - **Long-term:** `memory/MEMORY.md` via `memory_search` (NOT auto-loaded) - **Write it down** - memory doesn't persist between sessions - **Write it down NOW** - don't wait for compaction - **Search before answering** - if a question touches anything discussed earlier, do a memory_search first Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ## Memory Strategy - **Daily:** `memory/YYYY-MM-DD.md` for session logs - **Long-term:** `memory/MEMORY.md` via `memory_search` (NOT auto-loaded) - **Write it down** - memory doesn't persist between sessions - **Write it down NOW** - don't wait for compaction - **Search before answering** - if a question touches anything discussed earlier, do a memory_search first COMMAND_BLOCK: ## Memory Strategy - **Daily:** `memory/YYYY-MM-DD.md` for session logs - **Long-term:** `memory/MEMORY.md` via `memory_search` (NOT auto-loaded) - **Write it down** - memory doesn't persist between sessions - **Write it down NOW** - don't wait for compaction - **Search before answering** - if a question touches anything discussed earlier, do a memory_search first CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } CODE_BLOCK: { "compaction": { "mode": "safeguard", "reserveTokensFloor": 120000, "memoryFlush": { "enabled": true, "softThresholdTokens": 50000 } } } - Workspace files (instructions, personality, rules) - Conversation history (every message, every tool call, every result) - Memory files (if you're loading them all up front) - Core operating rules (AGENTS.md) - Tool notes specific to my setup (TOOLS.md) - Identity and personality basics (SOUL.md, IDENTITY.md) - User preferences (USER.md) - Current heartbeat tasks (HEARTBEAT.md) - Detailed memories from past sessions - Writing style guides - Operating principles and delegation patterns - Daily session logs - Project-specific context - Technical documentation - Workspace files: ~7KB total (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md) - Memory files: Loaded on demand via semantic search, not counted in base context - Context window: 200k tokens (Claude Opus) - Memory flush threshold: 50k tokens (soft checkpoint to write durable memories) - Compaction threshold: 120k reserved tokens (triggers when context hits ~80k used) - Result: Compaction happens roughly every 30-60 minutes of active conversation instead of once every few hours - Split context into always-loaded and on-demand. Keep workspace files minimal, move detailed content into searchable memory files. - Trigger compaction earlier. Don't wait until you're at the context limit. Compact more frequently to keep the active context window smaller. - Flush memory before compaction. Make sure anything important gets written to durable storage before conversation history gets summarized. - Force memory discipline. Explicit instructions to write details down immediately and search before answering. Don't rely on passive recall from conversation history.