Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind Evaluation.
Apply Evaluation to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 4 — Production Engineering

Chapter 15: Evaluation

Trajectory-level vs endpoint, benchmarks, LLM-as-judge, and building an eval harness

Why Standard NLP Metrics Fail for Agents

BLEU, ROUGE, and perplexity measure the quality of text generation in a single-turn setting. They cannot capture whether an agent: (1) called the right tools in the right order, (2) reached the correct goal state, (3) made efficient use of available information, or (4) failed gracefully under adversarial conditions.

Endpoint Evaluation

Did the agent reach the correct final state?
Is the final answer correct or useful?
Simple to measure; course-grained
Doesn't reveal why the agent failed

Trajectory Evaluation

Were the intermediate steps correct?
Was each tool call necessary and well-formed?
More expensive; requires annotated reference trajectories
Reveals failure points — enables targeted improvement

Key Agent Metrics

1
Task Success Rate (TSR)Fraction of tasks completed correctly — the primary metric. Must be measured on a held-out task set, not the training tasks.
2
Steps to SuccessAverage number of (Thought, Action, Observation) turns required to complete a task. Lower = more efficient. An agent that needs 30 steps for a task that takes a human 5 steps has a problem.
3
Error RateFraction of steps where the agent takes a clearly wrong action — wrong tool, malformed arguments, or unnecessary call. Measured via reference trajectory comparison.
4
Cost per TaskTotal LLM tokens + tool call costs per successful task completion. Critical for production viability — a task that costs $2 to complete is often not economically viable.
5
Failure Mode DistributionCategorize failures: wrong tool, hallucinated tool result, infinite loop, context overflow, refusal. Different categories require different fixes.

Agent Benchmarks

Benchmark	What it tests	Format	Why it matters
GAIA	General AI assistant tasks: real-world questions requiring tool use	~450 questions, 3 difficulty levels	Closest to real user tasks; covers web search, file reading, math
AgentBench	8 real-world environments (web, database, OS, code)	Multi-step tasks per environment	Breadth across agent types and tool categories
SWE-bench	Resolve real GitHub issues in open-source Python repos	300+ issues; measure % resolved	Gold standard for code agents; hard and grounded
WebArena	Navigate and complete tasks on realistic web environments	812 tasks across 5 websites	Evaluates computer-use agents end-to-end
ATBench	Long-horizon safety evaluation over multi-turn trajectories	1,000 trajectories, avg 9 turns	Catches risks that only emerge across multiple steps

LLM-as-judge: where it works and where it lies

Where it works well: scoring free-form text quality (relevance, clarity, helpfulness), comparing two responses side-by-side, checking for safety violations in output. Where it fails: verifying factual correctness (the judge LLM can be wrong on the same facts), evaluating code correctness (run tests instead), and providing consistent scores across session (temperature variance).

Always validate your LLM judge against human annotations on a sample set before using it for automated evaluation.

Building an Eval Harness

python — minimal agent eval harness

from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalTask:
    task_id: str
    input: str
    expected_output: str | None   # None for tasks with open-ended correct answers
    verify_fn: Callable[[str, str], bool] | None = None  # optional programmatic verifier

@dataclass
class EvalResult:
    task_id: str
    success: bool
    steps_taken: int
    cost_usd: float
    final_output: str
    failure_mode: str | None = None


class AgentEvalHarness:
    def __init__(self, agent_fn: Callable[[str], str], judge_llm=None) -> None:
        self.agent = agent_fn
        self.judge = judge_llm

    def run(self, tasks: list[EvalTask]) -> dict:
        results = []
        for task in tasks:
            result = self._run_one(task)
            results.append(result)

        tsr = sum(r.success for r in results) / len(results)
        avg_steps = sum(r.steps_taken for r in results) / len(results)
        avg_cost = sum(r.cost_usd for r in results) / len(results)

        return {
            "task_success_rate": tsr,
            "avg_steps_to_success": avg_steps,
            "avg_cost_per_task_usd": avg_cost,
            "failure_modes": self._count_failure_modes(results),
        }

    def _run_one(self, task: EvalTask) -> EvalResult:
        output, steps, cost = self._run_agent_with_instrumentation(task.input)

        if task.verify_fn and task.expected_output:
            success = task.verify_fn(output, task.expected_output)
        elif task.expected_output:
            success = self._llm_judge(task.input, output, task.expected_output)
        else:
            success = True   # no expected output — assume success if agent completed

        return EvalResult(
            task_id=task.task_id,
            success=success,
            steps_taken=steps,
            cost_usd=cost,
            final_output=output,
        )

Golden trajectories

A golden trajectory is a human-annotated sequence of (Thought, Action, Observation) turns that represents the correct way to solve a task. They are expensive to create but invaluable: they enable trajectory-level evaluation, few-shot examples in prompts, and regression testing. Start with 20–50 golden trajectories for your most critical task types.