Course Building Agentic AI Systems Chapter 15 Difficulty advanced Estimated Time 600 min

Chapter 15: Evaluation

Evaluation in Building Agentic AI Systems.

68% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind Evaluation.
  • Apply Evaluation to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 15: Evaluation

Trajectory-level vs endpoint, benchmarks, LLM-as-judge, and building an eval harness

Why Standard NLP Metrics Fail for Agents

BLEU, ROUGE, and perplexity measure the quality of text generation in a single-turn setting. They cannot capture whether an agent: (1) called the right tools in the right order, (2) reached the correct goal state, (3) made efficient use of available information, or (4) failed gracefully under adversarial conditions.

Endpoint Evaluation

  • Did the agent reach the correct final state?
  • Is the final answer correct or useful?
  • Simple to measure; course-grained
  • Doesn't reveal why the agent failed

Trajectory Evaluation

  • Were the intermediate steps correct?
  • Was each tool call necessary and well-formed?
  • More expensive; requires annotated reference trajectories
  • Reveals failure points — enables targeted improvement

Key Agent Metrics

1
Task Success Rate (TSR)Fraction of tasks completed correctly — the primary metric. Must be measured on a held-out task set, not the training tasks.
2
Steps to SuccessAverage number of (Thought, Action, Observation) turns required to complete a task. Lower = more efficient. An agent that needs 30 steps for a task that takes a human 5 steps has a problem.
3
Error RateFraction of steps where the agent takes a clearly wrong action — wrong tool, malformed arguments, or unnecessary call. Measured via reference trajectory comparison.
4
Cost per TaskTotal LLM tokens + tool call costs per successful task completion. Critical for production viability — a task that costs $2 to complete is often not economically viable.
5
Failure Mode DistributionCategorize failures: wrong tool, hallucinated tool result, infinite loop, context overflow, refusal. Different categories require different fixes.

Agent Benchmarks

BenchmarkWhat it testsFormatWhy it matters
GAIAGeneral AI assistant tasks: real-world questions requiring tool use~450 questions, 3 difficulty levelsClosest to real user tasks; covers web search, file reading, math
AgentBench8 real-world environments (web, database, OS, code)Multi-step tasks per environmentBreadth across agent types and tool categories
SWE-benchResolve real GitHub issues in open-source Python repos300+ issues; measure % resolvedGold standard for code agents; hard and grounded
WebArenaNavigate and complete tasks on realistic web environments812 tasks across 5 websitesEvaluates computer-use agents end-to-end
ATBenchLong-horizon safety evaluation over multi-turn trajectories1,000 trajectories, avg 9 turnsCatches risks that only emerge across multiple steps

LLM-as-judge: where it works and where it lies

Where it works well: scoring free-form text quality (relevance, clarity, helpfulness), comparing two responses side-by-side, checking for safety violations in output. Where it fails: verifying factual correctness (the judge LLM can be wrong on the same facts), evaluating code correctness (run tests instead), and providing consistent scores across session (temperature variance).

Always validate your LLM judge against human annotations on a sample set before using it for automated evaluation.

Building an Eval Harness

python — minimal agent eval harness
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalTask:
    task_id: str
    input: str
    expected_output: str | None   # None for tasks with open-ended correct answers
    verify_fn: Callable[[str, str], bool] | None = None  # optional programmatic verifier

@dataclass
class EvalResult:
    task_id: str
    success: bool
    steps_taken: int
    cost_usd: float
    final_output: str
    failure_mode: str | None = None


class AgentEvalHarness:
    def __init__(self, agent_fn: Callable[[str], str], judge_llm=None) -> None:
        self.agent = agent_fn
        self.judge = judge_llm

    def run(self, tasks: list[EvalTask]) -> dict:
        results = []
        for task in tasks:
            result = self._run_one(task)
            results.append(result)

        tsr = sum(r.success for r in results) / len(results)
        avg_steps = sum(r.steps_taken for r in results) / len(results)
        avg_cost = sum(r.cost_usd for r in results) / len(results)

        return {
            "task_success_rate": tsr,
            "avg_steps_to_success": avg_steps,
            "avg_cost_per_task_usd": avg_cost,
            "failure_modes": self._count_failure_modes(results),
        }

    def _run_one(self, task: EvalTask) -> EvalResult:
        output, steps, cost = self._run_agent_with_instrumentation(task.input)

        if task.verify_fn and task.expected_output:
            success = task.verify_fn(output, task.expected_output)
        elif task.expected_output:
            success = self._llm_judge(task.input, output, task.expected_output)
        else:
            success = True   # no expected output — assume success if agent completed

        return EvalResult(
            task_id=task.task_id,
            success=success,
            steps_taken=steps,
            cost_usd=cost,
            final_output=output,
        )

Golden trajectories

A golden trajectory is a human-annotated sequence of (Thought, Action, Observation) turns that represents the correct way to solve a task. They are expensive to create but invaluable: they enable trajectory-level evaluation, few-shot examples in prompts, and regression testing. Start with 20–50 golden trajectories for your most critical task types.

Chapter 15 Quiz

1. Why is trajectory evaluation more valuable than endpoint evaluation for improving an agent?

2. For which task should you use programmatic testing instead of LLM-as-judge?

3. What is the primary purpose of "golden trajectories" in agent evaluation?