Tools

Tools: 7-Layer Constitutional AI Guardrails: Preventing Agent Mistakes

2026-02-23 0 views admin

Tools: 7-Layer Constitutional AI Guardrails: Preventing Agent Mistakes

Source: Dev.to

7-Layer Constitutional Guardrails: Preventing AI Agent Mistakes Before They Happen ## The Problem ## The 7-Layer Framework ## Layer 1: Immutability Check ## Layer 2: Temporal Context ## Layer 3: Referential Integrity ## Layer 4: Authority Validation ## Layer 5: Deduplication ## Layer 6: Provenance Verification ## Layer 7: Constitutional Alignment ## Using the Guardrail API ## Via MCP (Claude Desktop) ## Real Results ## Implementing Your Own AI agents make mistakes. When they're operating autonomously — managing wallets, sending messages, executing contracts — mistakes are expensive. The standard answer is "add a human in the loop." But that defeats the purpose of autonomous agents. The real answer is constitutional guardrails: a validation framework that runs before every consequential action. Here's how we built it at ODEI, and how you can use it. Consider an autonomous agent managing USDC for a user. Without guardrails: These questions can't be answered by the LLM alone. They require structured checks against known facts, historical state, and explicit rules. ODEI's constitutional guardrail system validates every action through 7 sequential checks: Can this entity be modified? Some nodes in the world model are immutable after creation — founding documents, past transactions, signed commitments. Layer 1 prevents agents from accidentally rewriting history. Is this action still valid in time? Decisions expire. Authorizations have windows. Layer 2 checks that the action is timely — not stale from a previous session, not premature. Do all referenced entities exist? The action references wallet 0x.... Does that wallet exist in the world model? Is it a known, trusted entity? Layer 3 catches hallucinated references. Does this agent have permission? Not all agents can do all things. Layer 4 checks whether the requesting agent has the authority scope for this action, against the governance rules in the FOUNDATION layer. Has this exact action already been taken? Without deduplication, agents can send the same message twice, execute the same transaction twice, create the same entity twice. Layer 5 uses content hashing to detect duplicates. Where did this instruction come from? Is this action coming from a trusted source? Was it initiated by a verified principal or injected by an untrusted input? Layer 6 traces the instruction back to its origin. Does this violate fundamental principles? The highest-level check. The FOUNDATION layer of the world model contains constitutional principles — things the agent must never do. Layer 7 compares the action against these principles. Claude automatically calls odei_guardrail_check and returns the verdict with full reasoning. After running this in production since January 2026: The ESCALATE category is where most value is created: catching edge cases that would have been approved by a simple rule-based system but require human judgment. You don't need to use ODEI's service to implement this pattern. The architecture is: The hard part is building and maintaining the world model that the checks query against. That's why we built it as a service — maintaining 91 nodes and 91 relationship types is not trivial. ODEI's guardrail API is available at api.odei.ai. Free tier available. Deployed as Virtuals ACP Agent #3082 for agent-to-agent calls. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: curl -X POST https://api.odei.ai/api/v2/guardrail/check \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "action": "transfer 500 USDC to 0x8185ecd4170bE82c3eDC3504b05B3a8C88AFd129", "context": { "requester": "trading_agent_v2", "reason": "performance fee payment" }, "severity": "high" }' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: curl -X POST https://api.odei.ai/api/v2/guardrail/check \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "action": "transfer 500 USDC to 0x8185ecd4170bE82c3eDC3504b05B3a8C88AFd129", "context": { "requester": "trading_agent_v2", "reason": "performance fee payment" }, "severity": "high" }' COMMAND_BLOCK: curl -X POST https://api.odei.ai/api/v2/guardrail/check \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "action": "transfer 500 USDC to 0x8185ecd4170bE82c3eDC3504b05B3a8C88AFd129", "context": { "requester": "trading_agent_v2", "reason": "performance fee payment" }, "severity": "high" }' CODE_BLOCK: { "verdict": "ESCALATE", "score": 45, "layers": [ {"layer": "immutability", "result": "PASS"}, {"layer": "temporal", "result": "PASS"}, {"layer": "referential_integrity", "result": "PASS"}, {"layer": "authority", "result": "PASS"}, {"layer": "deduplication", "result": "PASS"}, {"layer": "provenance", "result": "WARN", "note": "Wallet not in trusted list"}, {"layer": "constitutional", "result": "WARN", "note": "Transfer exceeds daily limit"} ], "reasoning": "Transfer to unverified wallet exceeds daily limit. Escalate to human operator.", "timestamp": "2026-02-23T00:12:34Z" } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "verdict": "ESCALATE", "score": 45, "layers": [ {"layer": "immutability", "result": "PASS"}, {"layer": "temporal", "result": "PASS"}, {"layer": "referential_integrity", "result": "PASS"}, {"layer": "authority", "result": "PASS"}, {"layer": "deduplication", "result": "PASS"}, {"layer": "provenance", "result": "WARN", "note": "Wallet not in trusted list"}, {"layer": "constitutional", "result": "WARN", "note": "Transfer exceeds daily limit"} ], "reasoning": "Transfer to unverified wallet exceeds daily limit. Escalate to human operator.", "timestamp": "2026-02-23T00:12:34Z" } CODE_BLOCK: { "verdict": "ESCALATE", "score": 45, "layers": [ {"layer": "immutability", "result": "PASS"}, {"layer": "temporal", "result": "PASS"}, {"layer": "referential_integrity", "result": "PASS"}, {"layer": "authority", "result": "PASS"}, {"layer": "deduplication", "result": "PASS"}, {"layer": "provenance", "result": "WARN", "note": "Wallet not in trusted list"}, {"layer": "constitutional", "result": "WARN", "note": "Transfer exceeds daily limit"} ], "reasoning": "Transfer to unverified wallet exceeds daily limit. Escalate to human operator.", "timestamp": "2026-02-23T00:12:34Z" } CODE_BLOCK: { "mcpServers": { "odei": { "command": "npx", "args": ["@odei/mcp-server"] } } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "mcpServers": { "odei": { "command": "npx", "args": ["@odei/mcp-server"] } } } CODE_BLOCK: { "mcpServers": { "odei": { "command": "npx", "args": ["@odei/mcp-server"] } } } CODE_BLOCK: Check if I should approve: transfer 500 USDC to 0x... Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Check if I should approve: transfer 500 USDC to 0x... CODE_BLOCK: Check if I should approve: transfer 500 USDC to 0x... - Agent calls transfer(500, wallet_address) — is the wallet trusted? Is the amount within limits? Was this already done? - Agent posts to Twitter — is this duplicate content? Does it violate policies? - Agent approves a transaction — was this authorized by the right person at the right time? - APPROVED (65%): Routine operations that pass all 7 layers - REJECTED (15%): Actions that clearly violate rules (duplicates, unauthorized) - ESCALATE (20%): Actions that need human review (unknown wallets, threshold violations) - Define your layers (we use 7, you might use 3 or 10) - For each layer, write a check function that returns PASS/WARN/FAIL with reasoning - Aggregate the results into a final verdict - Log everything — the audit trail is as important as the verdict

🏷️ Tags

how-totutorialguidedev.toaillmservernode