Chapter 7: Agent Evaluation & Monitoring
Measuring Performance
Learning Objectives
- Understand agent evaluation & monitoring fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Agent Evaluation & Monitoring
Introduction
Measuring Performance
This chapter provides comprehensive coverage of agent evaluation & monitoring, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding agent evaluation & monitoring is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Agent Evaluation Metrics
Task success rate: Percentage of tasks completed successfully
Response quality:
- Accuracy: Correctness of agent outputs
- Relevance: How well output addresses task
- Completeness: Whether all requirements met
Efficiency metrics:
- Latency: Time to complete task
- Token usage: Cost per task
- Tool calls: Number of tool invocations
Monitoring Agent Behavior
What to monitor:
- Decision patterns: What actions agent chooses
- Tool usage: Which tools are used most
- Error rates: Frequency and types of errors
- Loop detection: Infinite loops or repetitive behavior
- Cost tracking: API calls and token usage
Evaluation Strategies
Automated evaluation: Use LLMs or rule-based systems to score agent outputs
Human evaluation: Human reviewers assess quality (gold standard but expensive)
Hybrid evaluation: Combine automated and human evaluation
A/B testing: Compare different agent configurations
Mathematical Formulations
Task Success Rate
What This Measures
This formula calculates the percentage of tasks that are completed successfully by the agent system. It measures overall agent reliability and system effectiveness. A high success rate indicates that agents consistently complete tasks correctly, while a low rate suggests problems that need investigation.
Breaking It Down
- Successful Tasks: Number of tasks completed successfully - tasks where: the agent provided a correct answer, the task was completed as expected, the output met quality standards, and the user's goal was achieved. Failed tasks include: incorrect answers, incomplete tasks, errors that prevented completion, or outputs that don't meet requirements.
- Total Tasks: Total number of tasks attempted - all tasks the system tried to complete, including both successful and failed ones. This is the denominator that provides context for the success rate.
- Success Rate: Percentage (0-100%) - the fraction of tasks that succeeded, expressed as a percentage. Higher values (closer to 100%) indicate better reliability. Typical targets: 95%+ for production systems, 99%+ for critical applications.
Where This Is Used
This metric is calculated periodically (daily, weekly) to monitor agent system health. It's used to: (1) track system reliability over time (is success rate improving or degrading?), (2) identify problem areas (which task types have low success rates?), (3) set quality targets (what success rate should we aim for?), (4) evaluate improvements (did changes increase success rate?), and (5) alert on degradation (success rate dropping indicates problems). This is a key performance indicator for production agent systems.
Why This Matters
Task success rate is a fundamental measure of agent system quality. Low success rates indicate: agents are making mistakes, tasks are too complex, system has bugs, or agents lack necessary capabilities. High success rates indicate: agents are reliable, system is well-designed, and users can trust the system. Monitoring success rate helps: identify issues early (catch problems before they impact many users), measure improvement (quantify impact of optimizations), set expectations (users know what to expect), and ensure quality (maintain standards). For production systems, maintaining high success rates (95%+) is essential for user trust and system adoption.
Example Calculation
Given: Agent system over 1 week
- Total Tasks = 1000 tasks attempted
- Successful Tasks = 950 tasks completed correctly
- Failed Tasks = 50 (errors, incorrect answers, incomplete)
Step 1: Calculate success rate = (950 / 1000) × 100% = 95%
Result: Success Rate = 95%
Analysis: 95% success rate is good for production - most tasks succeed, but 5% failure rate indicates room for improvement. Failed tasks should be analyzed to identify patterns (which task types fail? what errors occur?).
Target: Aim for 98%+ by: improving error handling, optimizing agent capabilities, refining task definitions, and learning from failures.
Interpretation: The system successfully completes 95% of tasks, indicating reliable operation. The 5% failure rate suggests some tasks are challenging or there are edge cases to handle. This metric helps track system health and guide improvements.
Average Task Latency
What This Measures
This formula calculates the average time it takes for the agent system to complete tasks. It measures system responsiveness by averaging the completion times of all tasks. Lower average latency indicates faster system performance, which is crucial for user experience in production systems.
Breaking It Down
- N: Number of tasks - the total count of tasks included in the latency calculation (e.g., 1000 tasks over a day, all tasks in a time period). A larger N provides a more representative average, while a smaller N may be skewed by outliers.
- T_i: Time to complete task i - the latency for each individual task, measured from when the task starts (user submits request) until it completes (agent provides final response). T_i includes: agent reasoning time, tool execution time, LLM generation time, and any coordination overhead. Each task may have different complexity, leading to different completion times.
- \(\sum_{i=1}^{N} T_i\): Sum of all task completion times - the total time spent on all N tasks. This aggregates all individual latencies into a single value.
- \(\frac{1}{N} \sum_{i=1}^{N} T_i\): Average (mean) latency - dividing the sum by N gives the average time per task. This provides a single metric representing typical task completion time. The average helps understand typical user experience, though it may be affected by outliers (very slow tasks).
- Lower is better: Reduced average latency means faster responses, better user experience, and higher system throughput. Typical targets: <2 seconds for simple tasks, <10 seconds for complex tasks, <30 seconds for very complex multi-step tasks.
Where This Is Used
This metric is calculated continuously to monitor system performance. It's used to: (1) track system speed over time (is latency increasing or decreasing?), (2) identify performance issues (sudden latency spikes indicate problems), (3) evaluate optimizations (did changes reduce latency?), (4) set performance targets (what latency should we aim for?), and (5) compare system versions (is new version faster?). This is a critical metric for production systems where user experience depends on response speed.
Why This Matters
Average latency directly impacts user experience and system adoption. High latency leads to: poor user experience (users wait too long), reduced throughput (system handles fewer requests), user abandonment (users give up waiting), and higher costs (longer-running tasks consume more resources). Low latency ensures: responsive user experience, high system throughput, user satisfaction, and efficient resource usage. Monitoring average latency helps: identify performance regressions, optimize slow components, set realistic expectations, and ensure system meets user needs. For production systems, maintaining low average latency (<5 seconds for most tasks) is essential for user satisfaction.
Example Calculation
Given: Agent system processes 100 tasks
- N = 100 tasks
- Task latencies: T_1 = 2s, T_2 = 3s, T_3 = 1.5s, ..., T_100 = 2.5s
- Sum of all latencies = 250 seconds
Step 1: Calculate sum: \(\sum_{i=1}^{100} T_i = 250\) seconds
Step 2: Calculate average: (1/100) × 250 = 2.5 seconds
Result: Avg Latency = 2.5 seconds
Analysis: Average latency of 2.5 seconds is excellent for most use cases - users get fast responses. If some tasks are much slower (outliers), consider: p95 latency (95th percentile) to understand worst-case, or separate metrics for different task types.
Target: Maintain <3 seconds average for simple tasks, <10 seconds for complex tasks. If average exceeds targets, investigate: slow tool calls, inefficient agent reasoning, or coordination bottlenecks.
Interpretation: The system completes tasks in an average of 2.5 seconds, indicating good responsiveness. This metric helps track performance and identify when optimizations are needed. If average latency increases to 5+ seconds, it's a signal to investigate and optimize.
Cost per Task
What This Measures
This formula calculates the total cost of completing a task by summing the costs of all API calls made during task execution. It accounts for token usage across all LLM API calls (reasoning, generation, tool selection) and their respective pricing. This helps track and optimize operational costs for agent systems.
Breaking It Down
- M: Number of API calls - the total count of LLM API invocations made during task execution. Each call might be: agent reasoning (deciding what to do), tool selection (choosing which tool to use), response generation (creating final answer), or intermediate steps (multi-step reasoning). More complex tasks typically require more API calls.
- tokens_i: Tokens used in call i - the number of tokens consumed in API call i, including both input tokens (prompt, context) and output tokens (generated text). Token counts vary based on: prompt length, context size, response length, and model used. Different models have different tokenization, so token counts are model-specific.
- price_per_token_i: Price per token for call i - the cost per token for the specific API call, which may vary by: model used (GPT-4 is more expensive than GPT-3.5), input vs output (output tokens often cost more), or pricing tier (different rates for different usage levels). Prices are typically in dollars per 1000 tokens (e.g., $0.002 per 1K input tokens, $0.006 per 1K output tokens).
- tokens_i × price_per_token_i: Cost of call i - the cost for a single API call, calculated by multiplying tokens used by price per token. This gives the cost for that specific call.
- \(\sum_{i=1}^{M}\): Sum over all calls - adding up costs from all M API calls gives the total cost for the task. This accounts for all LLM usage during task execution.
- Cost: Total cost per task - the complete cost of executing one task, including all reasoning, tool selection, and generation steps. This helps understand cost per user interaction and optimize for cost efficiency.
Where This Is Used
This cost calculation is performed for each task to track operational expenses. It's used to: (1) monitor system costs (how much does each task cost?), (2) optimize agent behavior (reduce unnecessary API calls), (3) set pricing (if charging users, need to cover costs), (4) evaluate cost efficiency (compare costs across different agent designs), and (5) budget planning (estimate monthly/annual costs). This is essential for production systems where cost management is critical.
Why This Matters
Cost management is crucial for sustainable production agent systems. LLM API calls can be expensive, especially with high token usage. Understanding cost per task helps: identify expensive operations (which tasks cost most?), optimize agent efficiency (reduce unnecessary calls, use cheaper models when possible), set user pricing (ensure costs are covered), budget accurately (predict monthly costs), and make trade-offs (quality vs cost). Without cost tracking, systems can become uneconomical. Typical targets: <$0.01 per simple task, <$0.10 per complex task, optimize to reduce costs while maintaining quality.
Example Calculation
Given: Agent completes a research task
- M = 5 API calls (reasoning, tool selection, tool execution reasoning, result integration, final generation)
- Call 1: 500 tokens × $0.002/1K = $0.001
- Call 2: 300 tokens × $0.002/1K = $0.0006
- Call 3: 800 tokens × $0.002/1K = $0.0016
- Call 4: 600 tokens × $0.002/1K = $0.0012
- Call 5: 1200 tokens (800 input, 400 output) × ($0.002/1K input + $0.006/1K output) = $0.0016 + $0.0024 = $0.004
Step 1: Calculate cost for each call
Step 2: Sum all costs: $0.001 + $0.0006 + $0.0016 + $0.0012 + $0.004 = $0.0084
Result: Cost = $0.0084 per task
Analysis: Cost of $0.0084 (less than 1 cent) is reasonable for a research task. At 1000 tasks/day, daily cost = $8.40, monthly = ~$252. This is manageable for most applications.
Optimization: To reduce costs, could: use cheaper models for simple steps, reduce context size, cache common responses, or optimize prompts to generate shorter responses.
Interpretation: The task costs $0.0084, which is affordable. Tracking this metric helps ensure costs remain manageable as the system scales. If costs increase significantly, it's a signal to optimize agent behavior or consider cost-saving strategies.
Detailed Examples
Example: Evaluating Agent Performance
Test set: 100 tasks
Results:
- Successful: 85 tasks
- Failed: 10 tasks
- Timeout: 5 tasks
Metrics:
- Success rate: 85%
- Average latency: 3.2 seconds
- Average cost: $0.05 per task
Analysis: Agent performs well but has room for improvement in failure handling.
Example: Monitoring Agent Behavior
Observed patterns:
- Agent uses search tool 60% of the time
- Average 3 tool calls per task
- Most common error: Tool timeout (40% of failures)
- No infinite loops detected
Action items:
- Optimize search tool (reduce timeout rate)
- Cache frequent searches
- Add retry logic for timeouts
Implementation
Agent Evaluation System
import time
from typing import List, Dict
class AgentEvaluator:
"""Evaluate agent performance"""
def __init__(self):
self.metrics = {
'total_tasks': 0,
'successful_tasks': 0,
'failed_tasks': 0,
'total_latency': 0,
'total_cost': 0
}
def evaluate_task(self, task, agent, expected_output=None):
"""Evaluate single task"""
start_time = time.time()
try:
result = agent.execute(task)
latency = time.time() - start_time
cost = self.estimate_cost(agent.last_api_calls)
# Check success
success = self.check_success(result, expected_output)
# Update metrics
self.metrics['total_tasks'] += 1
if success:
self.metrics['successful_tasks'] += 1
else:
self.metrics['failed_tasks'] += 1
self.metrics['total_latency'] += latency
self.metrics['total_cost'] += cost
return {
'success': success,
'latency': latency,
'cost': cost,
'result': result
}
except Exception as e:
self.metrics['failed_tasks'] += 1
return {'success': False, 'error': str(e)}
def get_metrics(self):
"""Get aggregated metrics"""
total = self.metrics['total_tasks']
if total == 0:
return {}
return {
'success_rate': self.metrics['successful_tasks'] / total,
'avg_latency': self.metrics['total_latency'] / total,
'avg_cost': self.metrics['total_cost'] / total,
'total_tasks': total
}
def check_success(self, result, expected):
"""Check if result matches expected output"""
if expected is None:
return result is not None
return result == expected
def estimate_cost(self, api_calls):
"""Estimate cost from API calls"""
# Simplified: assume $0.002 per 1K tokens
total_tokens = sum(call.get('tokens', 0) for call in api_calls)
return (total_tokens / 1000) * 0.002
# Example usage
evaluator = AgentEvaluator()
# evaluator.evaluate_task("Task 1", agent, expected_output="...")
# metrics = evaluator.get_metrics()
Real-World Applications
Evaluation and Monitoring in Production
Continuous monitoring:
- Track agent performance in real-time
- Alert on quality degradation
- Detect anomalies in behavior
- Monitor costs and usage
A/B testing:
- Compare different agent configurations
- Test new prompts or tools
- Measure impact of changes
Quality assurance:
- Validate agent outputs before deployment
- Regression testing for agent updates
- Compliance and safety checks