Chapter 15: Evaluation
Evaluation in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind Evaluation.
- Apply Evaluation to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 15: Evaluation
Trajectory-level vs endpoint, benchmarks, LLM-as-judge, and building an eval harness
Why Standard NLP Metrics Fail for Agents
BLEU, ROUGE, and perplexity measure the quality of text generation in a single-turn setting. They cannot capture whether an agent: (1) called the right tools in the right order, (2) reached the correct goal state, (3) made efficient use of available information, or (4) failed gracefully under adversarial conditions.
Endpoint Evaluation
- Did the agent reach the correct final state?
- Is the final answer correct or useful?
- Simple to measure; course-grained
- Doesn't reveal why the agent failed
Trajectory Evaluation
- Were the intermediate steps correct?
- Was each tool call necessary and well-formed?
- More expensive; requires annotated reference trajectories
- Reveals failure points — enables targeted improvement
Key Agent Metrics
Agent Benchmarks
| Benchmark | What it tests | Format | Why it matters |
|---|---|---|---|
| GAIA | General AI assistant tasks: real-world questions requiring tool use | ~450 questions, 3 difficulty levels | Closest to real user tasks; covers web search, file reading, math |
| AgentBench | 8 real-world environments (web, database, OS, code) | Multi-step tasks per environment | Breadth across agent types and tool categories |
| SWE-bench | Resolve real GitHub issues in open-source Python repos | 300+ issues; measure % resolved | Gold standard for code agents; hard and grounded |
| WebArena | Navigate and complete tasks on realistic web environments | 812 tasks across 5 websites | Evaluates computer-use agents end-to-end |
| ATBench | Long-horizon safety evaluation over multi-turn trajectories | 1,000 trajectories, avg 9 turns | Catches risks that only emerge across multiple steps |
LLM-as-judge: where it works and where it lies
Where it works well: scoring free-form text quality (relevance, clarity, helpfulness), comparing two responses side-by-side, checking for safety violations in output. Where it fails: verifying factual correctness (the judge LLM can be wrong on the same facts), evaluating code correctness (run tests instead), and providing consistent scores across session (temperature variance).
Always validate your LLM judge against human annotations on a sample set before using it for automated evaluation.
Building an Eval Harness
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalTask:
task_id: str
input: str
expected_output: str | None # None for tasks with open-ended correct answers
verify_fn: Callable[[str, str], bool] | None = None # optional programmatic verifier
@dataclass
class EvalResult:
task_id: str
success: bool
steps_taken: int
cost_usd: float
final_output: str
failure_mode: str | None = None
class AgentEvalHarness:
def __init__(self, agent_fn: Callable[[str], str], judge_llm=None) -> None:
self.agent = agent_fn
self.judge = judge_llm
def run(self, tasks: list[EvalTask]) -> dict:
results = []
for task in tasks:
result = self._run_one(task)
results.append(result)
tsr = sum(r.success for r in results) / len(results)
avg_steps = sum(r.steps_taken for r in results) / len(results)
avg_cost = sum(r.cost_usd for r in results) / len(results)
return {
"task_success_rate": tsr,
"avg_steps_to_success": avg_steps,
"avg_cost_per_task_usd": avg_cost,
"failure_modes": self._count_failure_modes(results),
}
def _run_one(self, task: EvalTask) -> EvalResult:
output, steps, cost = self._run_agent_with_instrumentation(task.input)
if task.verify_fn and task.expected_output:
success = task.verify_fn(output, task.expected_output)
elif task.expected_output:
success = self._llm_judge(task.input, output, task.expected_output)
else:
success = True # no expected output — assume success if agent completed
return EvalResult(
task_id=task.task_id,
success=success,
steps_taken=steps,
cost_usd=cost,
final_output=output,
)
Golden trajectories
A golden trajectory is a human-annotated sequence of (Thought, Action, Observation) turns that represents the correct way to solve a task. They are expensive to create but invaluable: they enable trajectory-level evaluation, few-shot examples in prompts, and regression testing. Start with 20–50 golden trajectories for your most critical task types.
Chapter 15 Quiz
1. Why is trajectory evaluation more valuable than endpoint evaluation for improving an agent?
2. For which task should you use programmatic testing instead of LLM-as-judge?
3. What is the primary purpose of "golden trajectories" in agent evaluation?