How to Evaluate and Test AI Agents
Evaluating AI agents requires different metrics than evaluating LLMs. Here's how to measure task completion, trajectory quality, tool-use accuracy, and regression across agent systems.
Evaluating an LLM is hard enough. Evaluating an AI agent is harder, because the output isn’t just text. It’s a sequence of decisions: which tools to call, in what order, with what arguments, and whether the final result actually solves the user’s problem. A correct final answer reached through a wasteful 15-step path is a different kind of failure than a wrong answer reached in two steps.
If you’re building agents with LangChain, CrewAI, or LlamaIndex, or wiring your own with function calling, you need evaluation that covers the full agent behavior, not just the last line of output.
Why Agent Eval Is Different
Standard LLM evaluation asks: “Is this response correct and useful?” Agent evaluation asks several additional questions:
- Did the agent complete the task?
- Did it use the right tools?
- Did it pass correct arguments to those tools?
- Was the execution path efficient?
- Did it handle errors and edge cases?
- Would it produce the same result on the same input tomorrow?
A customer support agent that resolves a refund request by calling the right API with the right parameters is succeeding. One that calls the right API but passes the wrong order ID is failing, even if its final message to the user sounds helpful. The final message quality matters, but the tool execution accuracy matters more.
Task Completion Rate
The most important metric: did the agent accomplish what the user asked?
Define a set of test tasks with clear success criteria. Run the agent against them. Measure the binary pass rate.
test_cases = [
{
"input": "Book a meeting with Sarah for tomorrow at 2pm",
"success_criteria": lambda result: (
result.meeting_created
and result.attendee == "sarah@company.com"
and result.time.hour == 14
)
},
{
"input": "What were our Q1 sales numbers?",
"success_criteria": lambda result: (
"Q1" in result.response
and result.data_source == "sales_dashboard"
)
}
]
Start with 50-100 test cases that cover your agent’s core use cases. Run them after every prompt change, tool update, or model swap. A drop in task completion rate is the clearest signal that something broke.
Trajectory Evaluation
Task completion tells you if the agent succeeded. Trajectory evaluation tells you how.
A trajectory is the sequence of steps the agent took: the tools it called, the arguments it passed, the intermediate results it received, and the reasoning it produced between steps.
Evaluate trajectories on:
Efficiency. Did the agent take the shortest reasonable path, or did it wander? An agent that checks three irrelevant databases before finding the right one is completing the task but wasting time and money.
Correctness at each step. Even if the final result is right, intermediate errors matter. A tool call with the wrong parameters that happens to return a correct result (by coincidence) is a latent bug.
Unnecessary steps. Does the agent call tools it doesn’t need? Does it re-query information it already has? Unnecessary tool calls add latency and cost.
Compare agent trajectories against reference trajectories (the ideal step sequence for each test case) to score path quality.
Tool-Use Accuracy
For agents built on function calling, tool-use accuracy measures three things:
- Tool selection: Did the agent pick the right tool? If it should have called
search_ordersbut calledsearch_customers, that’s a selection error. - Argument accuracy: Did it pass the correct arguments? Wrong dates, malformed IDs, and missing required fields all count as argument errors.
- Execution order: For multi-step tasks, did it call tools in the right sequence? Some operations depend on the results of others.
Track these separately. An agent with 95% tool selection accuracy but 70% argument accuracy has a very specific problem you can fix with better tool descriptions or structured output constraints.
LLM-as-Judge for Agent Responses
For the natural language parts of agent output (explanations, summaries, messages to users), use LLM-as-judge evaluation. Have a separate model score the agent’s final response on criteria like:
- Relevance: Does the response address what the user asked?
- Accuracy: Are the facts in the response correct given the data retrieved?
- Completeness: Did the agent cover all aspects of the request?
- Tone: Is the response appropriate for the context?
LLM-as-judge is imperfect but scalable. It catches obvious failures (hallucinated data, off-topic responses) reliably, and catches subtle issues (slightly misleading summaries) with moderate accuracy.
Regression Testing
Agent behavior is fragile. A small prompt change can cascade through tool selection, argument formatting, and response quality. Model updates from providers can shift behavior without warning.
Build a regression suite:
- Golden test cases: 50-100 cases with known-good trajectories and outputs. Run after every change.
- Snapshot tests: Record the agent’s trajectory and output for a large set of inputs. On subsequent runs, flag any deviation from the snapshot. Not every change is a regression, but every change should be reviewed.
- A/B comparison: When changing prompts or models, run both versions on the same test set and compare metrics side by side before deploying.
Automate this. Run the regression suite in CI on every pull request that touches agent prompts, tool definitions, or orchestration logic. A 5% drop in task completion is easy to miss in manual testing but obvious in automated metrics.
Human-in-the-Loop Evaluation
Automated evaluation catches most issues, but some failures are only visible to humans. An agent that technically completes a task but does so in a confusing or frustrating way won’t show up in pass/fail metrics.
Sample 1-5% of production interactions for human review. Rate them on dimensions that automated metrics miss: user effort, clarity of communication, and appropriate escalation (did the agent hand off to a human when it should have?).
Human evaluation is expensive and slow. Use it to calibrate your automated metrics, not as your primary feedback loop. If human reviewers consistently disagree with your LLM-as-judge scores, adjust the judge’s criteria.
Build an Eval Harness
Tie all of this together in a reusable evaluation harness:
- Define test cases with inputs, expected tools, expected arguments, and success criteria.
- Run the agent in a sandboxed environment where tool calls are intercepted and logged (not executed against production systems).
- Score each run on task completion, trajectory efficiency, tool-use accuracy, and response quality.
- Output a summary report with pass rates, regressions, and flagged cases.
The harness becomes the source of truth for agent quality. Every prompt change, model upgrade, and architectural decision gets evaluated against it. Without it, you’re shipping agent changes on intuition. With it, you’re shipping on evidence.
For the broader context of LLM evaluation approaches, including judge prompt design and scoring rubrics, see How to Evaluate AI Output (LLM-as-Judge Explained). For understanding how different frameworks handle agent orchestration (which affects what you need to evaluate), see Multi-Agent Systems Explained.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Perplexity Opens Waitlist for Always-On Local AI Agent on Mac
Perplexity's new waitlist turns a spare Mac into a persistent local AI agent with approvals, logs, and a kill switch.
How to Evaluate AI Output (LLM-as-Judge Explained)
Traditional tests don't work for AI output. Here's how to evaluate quality using LLM-as-judge, automated checks, human review, and continuous evaluation frameworks.
How to Add Memory to AI Agents
AI agents without memory forget everything between turns. Here's how to implement conversation buffers, sliding windows, summary memory, and vector-backed long-term recall.
Multi-Agent Systems Explained: When One Agent Isn't Enough
Multi-agent systems use specialized AI agents working together on complex tasks. Here's how they work, the main architecture patterns, and when they're worth the complexity.
AI Agents vs Chatbots: What's the Difference?
Not every AI chatbot is an agent, and not every task needs one. Here's the real distinction between agents and chatbots, the spectrum between them, and when each makes sense.