Tools: I Tested 100 SOUL.md Configurations — Here's What Actually Works

Tools: I Tested 100 SOUL.md Configurations — Here's What Actually Works

Source: Dev.to

The Experiment Setup ## Finding #1: Optimal SOUL.md Length Is 800-1,200 Words ## Finding #2: The Five Sections That Matter Most ## Finding #3: Specific Examples Beat Abstract Rules ## Finding #4: Modal Instructions Dramatically Improve Versatility ## Finding #5: Memory Integration Is a Force Multiplier ## Finding #6: Negative Instructions Are More Effective Than Positive Ones ## Finding #7: The "Personality Tax" Is Real But Small ## Finding #8: Update Frequency Matters ## The Top 5 Configurations That Performed Best ## 1. The Specialist (Best for focused technical work) ## 2. The Adaptive Expert (Best for varied workflows) ## 3. The Pair Programmer (Best for collaborative coding) ## 4. The Ops Guardian (Best for infrastructure/DevOps) ## 5. The Research Analyst (Best for data and analysis) ## Practical Takeaways ## Get Started Fast Over the past three months, I've been running a systematic experiment. I created, tested, and refined 100 different SOUL.md configurations for OpenClaw agents across a range of use cases — from solo dev workflows to team-based project management. I tracked response quality, task completion rates, error frequency, and how often I had to correct the agent. The results were surprising, sometimes counterintuitive, and genuinely useful. Here's what the data says about building effective AI agents. I'm not claiming this is a peer-reviewed study. But 2,000 task evaluations across 100 configurations gives us real patterns to work with. This was the clearest signal in the data. Too short and the agent lacks context. Too long and critical instructions get diluted in the noise. The sweet spot is 800-1,200 words — enough to be comprehensive without overwhelming the context window. The drop-off above 2,000 words was notable. Longer SOUL.md files often contained contradictory instructions that confused the agent. Not all SOUL.md sections are created equal. I tested configurations with different section combinations and measured the impact of each. Impact ranking by section (measured by improvement in task completion): The tech stack section being #1 surprised me. But it makes sense — when your agent knows your exact tools, it stops suggesting irrelevant alternatives. Every suggestion is immediately actionable. The decision framework at #2 was the real revelation. Most people skip this section entirely, but it had the second-highest impact. When agents have clear principles for handling ambiguity, they make dramatically fewer wrong calls. Configurations that included concrete examples outperformed abstract-only instructions by 23% on consistency scores. The abstract version is technically correct but gives the agent too much room for interpretation. The example-backed version creates a shared understanding of what "clean code" actually means in your context. Configurations with mode-specific instructions (different behavior for code review vs. debugging vs. brainstorming) scored 31% higher on relevance compared to single-mode configurations. The best-performing pattern: This works because different tasks genuinely require different approaches. You don't want your agent to brainstorm with the same caution it uses for production deployments. Configurations that referenced a memory system (MEMORY.md, daily notes) showed 28% fewer repeated corrections across sessions. The pattern that worked best: Without this, every session starts from zero. With it, your agent accumulates knowledge and gets better over time. This is the difference between a tool and a partner. This was counterintuitive. "Don't do X" outperformed "Do Y" for boundary-setting by a significant margin. The best approach uses both, but if you're choosing one, negative instructions are more reliable for safety-critical boundaries. Adding personality traits (humor, warmth, directness) to your SOUL.md costs about 2-3% in raw task completion but increases user satisfaction significantly. In my testing, I found myself working longer and more productively with agents that had personality. The key is keeping personality lightweight: Four lines. That's all you need. Don't write a character sheet. Configurations that were updated weekly outperformed static ones by 19% after the first month. Your workflow evolves, your preferences change, and your SOUL.md should reflect that. The best practice I found: The agents with regularly updated SOUL.md files felt noticeably more aligned with their users over time. Across all 100 configurations, these five patterns consistently scored highest: If you're starting from scratch, here's the formula that works: Building a SOUL.md from scratch is time-consuming, especially when you're trying to get the structure right. I've compiled everything I learned from this experiment into ready-to-use templates. If you want to see what a well-structured SOUL.md looks like in practice, I put together a Mega Pack of 100 SOUL.md Templates covering every use case I tested — from backend development to content creation to DevOps. Each template is based on the configurations that actually performed well in this experiment. These aren't generic fill-in-the-blank templates. Each one is built on the patterns that scored highest in testing: optimal length, all five critical sections, concrete examples, modal instructions, and memory integration built in. For a lighter starting point, there's also a 20 Template Pack with the top performers from each category, or a Free Starter Pack if you just want to see the format and start experimenting. The difference between a mediocre agent and a great one isn't the model — it's the SOUL.md. The data is clear on that. What patterns have you found work well in your SOUL.md? I'd love to compare notes — drop a comment below. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: Write clean, maintainable code. Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Write clean, maintainable code. CODE_BLOCK: Write clean, maintainable code. CODE_BLOCK: Write clean, maintainable code. For example: - Functions under 20 lines - Descriptive variable names (userEmail, not ue) - Early returns over nested conditionals - Comments explain "why," not "what" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Write clean, maintainable code. For example: - Functions under 20 lines - Descriptive variable names (userEmail, not ue) - Early returns over nested conditionals - Comments explain "why," not "what" CODE_BLOCK: Write clean, maintainable code. For example: - Functions under 20 lines - Descriptive variable names (userEmail, not ue) - Early returns over nested conditionals - Comments explain "why," not "what" COMMAND_BLOCK: ## Default Mode [baseline behavior] ## When Reviewing Code [specific review behavior] ## When Debugging [specific debug behavior] ## When Writing Documentation [specific docs behavior] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ## Default Mode [baseline behavior] ## When Reviewing Code [specific review behavior] ## When Debugging [specific debug behavior] ## When Writing Documentation [specific docs behavior] COMMAND_BLOCK: ## Default Mode [baseline behavior] ## When Reviewing Code [specific review behavior] ## When Debugging [specific debug behavior] ## When Writing Documentation [specific docs behavior] COMMAND_BLOCK: ## Memory Protocol - Read MEMORY.md at session start for long-term context - Read today's daily note for recent decisions - Record important decisions and preferences to memory - When a correction is made, note it to prevent recurrence Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ## Memory Protocol - Read MEMORY.md at session start for long-term context - Read today's daily note for recent decisions - Record important decisions and preferences to memory - When a correction is made, note it to prevent recurrence COMMAND_BLOCK: ## Memory Protocol - Read MEMORY.md at session start for long-term context - Read today's daily note for recent decisions - Record important decisions and preferences to memory - When a correction is made, note it to prevent recurrence COMMAND_BLOCK: ## Personality - Direct and pragmatic - Dry humor when appropriate - Admits uncertainty honestly - Doesn't over-explain obvious things Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: ## Personality - Direct and pragmatic - Dry humor when appropriate - Admits uncertainty honestly - Doesn't over-explain obvious things COMMAND_BLOCK: ## Personality - Direct and pragmatic - Dry humor when appropriate - Admits uncertainty honestly - Doesn't over-explain obvious things - 100 unique SOUL.md configurations - 12 different use case categories (backend dev, frontend dev, DevOps, data analysis, content writing, code review, debugging, project management, research, API design, testing, documentation) - Each configuration ran through 20 standardized tasks - Scored on: accuracy, relevance, consistency, and "correction rate" (how often I had to fix or redirect the agent) - Task completion without intervention (%) - Response relevance score (1-5) - Consistency across sessions (1-5) - Average corrections per task - Tech Stack Definition — +18% task completion - Decision Framework — +15% task completion - Communication Style — +12% task completion - Identity/Role — +11% task completion - Boundaries/Safety — +9% task completion - Week 1-2: Update SOUL.md after every session based on corrections - Week 3-4: Update weekly with accumulated learnings - Month 2+: Update bi-weekly or when workflows change - Strong identity with specific expertise - Detailed tech stack - Strict boundaries - Minimal personality - Score: 91% task completion, 0.8 corrections/task - Moderate identity - Modal instructions for different tasks - Decision framework - Memory integration - Score: 89% task completion, 0.9 corrections/task - Peer-level identity - Strong communication style section - Code-first response preference - Proactive suggestion behavior - Score: 88% task completion, 1.0 corrections/task - Conservative decision framework - Extensive boundary definitions - Checklist-driven approach - Confirmation requirements for risky actions - Score: 87% task completion, 0.7 corrections/task - Structured output preferences - Source citation requirements - Uncertainty quantification - Iterative refinement protocol - Score: 85% task completion, 1.1 corrections/task - Keep it 800-1,200 words - Always include: Tech stack, decision framework, communication style, identity, boundaries - Use concrete examples alongside abstract rules - Add modal instructions for your top 3 task types - Set up memory integration from day one - Use negative instructions for safety boundaries - Keep personality to 4 lines or fewer - Update regularly — weekly for the first month, then bi-weekly