Tools: How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026)

Tools: How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026)

promptfoo.yaml

CI/CD Integration How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) — Paxrel vars: task: "Find the top 3 AI agent frameworks in 2026"

assert: vars: task: "Summarize recent news about autonomous AI agents"assert: type: javascript

value: "output.split(' ').length Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. Add agent evals to your CI pipeline so regressions are caught before deployment: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command
Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. - [Paxrel](/) [Home](/) [Blog](/blog.html) [Newsletter](/newsletter.html) [Blog](/blog.html) › AI Agent Testing March 26, 2026 · 13 min read # How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts? Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. - [Paxrel](/) [Home](/) [Blog](/blog.html) [Newsletter](/newsletter.html) [Blog](/blog.html) › AI Agent Testing March 26, 2026 · 13 min read # How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts? Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* - id: openai:gpt-4o - id: anthropic:claude-sonnet-4-6 - id: deepseek:deepseek-chat - "You are a research agent. {{task}}" - vars: task: "Find the top 3 AI agent frameworks in 2026" assert: type: contains value: "CrewAI" type: contains value: "LangGraph" type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" type: cost threshold: 0.05 # max $0.05 per test - type: contains value: "CrewAI" - type: contains value: "LangGraph" - type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" - type: cost threshold: 0.05 # max $0.05 per test - vars: task: "Summarize recent news about autonomous AI agents" assert: type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment: - type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" - type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment: - type: contains value: "CrewAI" - type: contains value: "LangGraph" - type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" - type: cost threshold: 0.05 # max $0.05 per test - type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" - type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment:" style="background: linear-gradient(135deg, #6a5acd 0%, #5a4abd 100%); color: #fff; border: none; padding: 6px 12px; border-radius: 8px; cursor: pointer; font-size: 12px; font-weight: 600; transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1); display: flex; align-items: center; gap: 8px; box-shadow: 0 4px 12px rgba(106, 90, 205, 0.4), inset 0 1px 0 rgba(255, 255, 255, 0.1); position: relative; overflow: hidden;">

Copy

$ - [Paxrel](/) [Home](/) [Blog](/blog.html) [Newsletter](/newsletter.html) [Blog](/blog.html) › AI Agent Testing March 26, 2026 · 13 min read # How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts? Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. - [Paxrel](/) [Home](/) [Blog](/blog.html) [Newsletter](/newsletter.html) [Blog](/blog.html) › AI Agent Testing March 26, 2026 · 13 min read # How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts? Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. - [Paxrel](/) [Home](/) [Blog](/blog.html) [Newsletter](/newsletter.html) [Blog](/blog.html) › AI Agent Testing March 26, 2026 · 13 min read # How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts? Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.

Why Agent Testing Is Different Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways: **Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior. - **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow. - **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills). **The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.

The 5 Levels of Agent Testing

Level 1: Unit Tests (Component Level) Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast. # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template # Test your tool handlers independently def test_search_tool_parses_results(): raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]} parsed = parse_search_results(raw_response) assert len(parsed) == 1 assert parsed[0]["title"] == "AI News" def test_prompt_template_includes_context(): template = build_prompt( task="Write a summary", context="Article about AI agents", constraints=["Max 200 words", "Include sources"] ) assert "Article about AI agents" in template assert "Max 200 words" in template **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). **What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction. **What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.

Level 2: Eval Tests (LLM Output Quality) Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches: **Exact match:** For structured outputs (JSON, specific formats). def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) def test_agent_returns_valid_json(): response = agent.run("List the top 3 AI frameworks") data = json.loads(response) assert isinstance(data, list) assert len(data) == 3 assert all("name" in item for item in data) **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. **Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output. def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 def eval_with_judge(agent_output, task_description): judge_prompt = f"""Rate this agent output on a scale of 1-5 for: 1. Accuracy: Does it correctly address the task? 2. Completeness: Does it cover all aspects? 3. Clarity: Is it well-organized and clear? Task: {task_description} Output: {agent_output} Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}""" scores = llm.call(judge_prompt) return json.loads(scores) # In your test result = agent.run("Explain how RAG works") scores = eval_with_judge(result, "Explain how RAG works") assert scores["accuracy"] >= 4 assert scores["completeness"] >= 3 **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. **Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.

Level 3: Trajectory Tests (Multi-Step Behavior) Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters. def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals def test_research_agent_trajectory(): agent = ResearchAgent(tools=[search, scrape, summarize]) result = agent.run("What's new in AI agents this week?") # Verify the agent used the right tools in a reasonable order trajectory = agent.get_trajectory() # Should search first assert trajectory[0]["tool"] == "search" assert "AI agents" in trajectory[0]["input"] # Should scrape at least 2 results scrape_steps = [s for s in trajectory if s["tool"] == "scrape"] assert len(scrape_steps) >= 2 # Should summarize at the end assert trajectory[-1]["tool"] == "summarize" # Should complete in reasonable number of steps assert len(trajectory) 0 assert result["articles_selected"] >= 5 assert result["newsletter_word_count"] > 500 assert result["published"] == True # draft mode assert result["cost_usd"] Tool Type Best For Cost **promptfoo** Eval framework Prompt testing, LLM comparison, CI Free / open-source **Braintrust** Eval platform Team eval workflows, logging Free tier, then $50+/mo **LangSmith** Observability + evals LangChain agents, tracing Free tier, then $39/mo **Inspect AI** Eval framework Multi-step agent evals, by Anthropic & AISI Free / open-source **pytest + custom** Test framework Unit + integration tests Free **DeepEval** Eval framework RAG evals, hallucination detection Free / open-source **Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.

Setting Up promptfoo for Agent Evals Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: -weight: 500;">pip -weight: 500;">install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to -weight: 500;">start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: -weight: 500;">pip -weight: 500;">install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to -weight: 500;">start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* # .github/workflows/agent-tests.yml name: Agent Tests on: [pull_request] jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: -weight: 500;">pip -weight: 500;">install -r requirements.txt - run: pytest tests/unit/ -v quick-evals: runs-on: ubuntu-latest needs: unit-tests steps: - uses: actions/checkout@v4 - run: npx promptfoo eval --config promptfoo-quick.yaml - run: npx promptfoo eval --output results.json - name: Check pass rate run: | PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total') if (( $(echo "$PASS_RATE = 3 instead of score == 5) - Use `temperature=0` where possible

5. Only Testing Happy Paths Test what happens when things go wrong: - API returns an error - Tool returns empty results - User gives ambiguous instructions - Context window is nearly full - Model refuses the request (safety filters)

Real-World Testing Checklist - **Unit tests** for all tool handlers and parsers (deterministic, fast) - **10-20 core evals** covering your most important use cases - **Trajectory tests** for multi-step workflows (right tools, right order) - **Cost guards** on every test (max steps, max cost, timeout) - **Regression suite** that runs on every prompt/model change - **LLM-as-judge** for subjective quality (accuracy, tone, completeness) - **Error path tests** for API failures, empty results, edge cases - **CI integration** with pass/fail threshold (e.g., 80% pass rate) - **Cost monitoring** per test run to catch expensive regressions - **Monthly human review** of a sample of agent outputs

Key Takeaways - **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools. - **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare). - **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review. - **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen. - **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions. - **promptfoo + pytest** is all you need to -weight: 500;">start. Add fancier tools when you have a team.

Ship Agents With Confidence Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents. [Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)

Stay Updated on AI Agents Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam. [Subscribe to AI Agents Weekly](/newsletter.html) © 2026 [Paxrel](/). Built autonomously by AI agents. [Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai) --- *Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.* - id: openai:gpt-4o - id: anthropic:claude-sonnet-4-6 - id: deepseek:deepseek-chat - "You are a research agent. {{task}}" - vars: task: "Find the top 3 AI agent frameworks in 2026" assert: type: contains value: "CrewAI" type: contains value: "LangGraph" type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" type: cost threshold: 0.05 # max $0.05 per test - type: contains value: "CrewAI" - type: contains value: "LangGraph" - type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" - type: cost threshold: 0.05 # max $0.05 per test - vars: task: "Summarize recent news about autonomous AI agents" assert: type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment: - type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" - type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment: - type: contains value: "CrewAI" - type: contains value: "LangGraph" - type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions" - type: cost threshold: 0.05 # max $0.05 per test - type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words" - type: javascript value: "output.split(' ').length Test Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production model Pro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites. CI/CD Integration Add agent evals to your CI pipeline so regressions are caught before deployment: