Chapter 19: CI/CD for Agents
CI/CD for Agents in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind CI/CD for Agents.
- Apply CI/CD for Agents to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 19: CI/CD for Agents
Testing pipelines, golden trajectories, canary deployments, and rollback
Why Agent CI/CD is Harder Than Standard CI/CD
Standard CI/CD tests deterministic code: given input X, expect output Y. Agents are non-deterministic: the same input may produce different outputs on different runs (due to LLM temperature, API latency, and tool result variation). This means traditional unit tests must be supplemented with probabilistic and trajectory-based approaches.
Standard CI/CD
- Unit test: assert output == expected
- Integration test: end-to-end flow
- One correct answer per test case
- Deterministic pass/fail
- Fast (seconds)
Agent CI/CD
- Unit test: tool schemas, prompt parsing, dispatchers
- Trajectory test: N runs, measure success rate
- Many acceptable answers per task
- Probabilistic pass (e.g., 95% success rate)
- Slow (minutes per eval run)
Testing Layers
import pytest
import json
from my_agent.tools import dispatch_tool_call
def test_dispatch_routes_correctly():
registry = {
"search_web": lambda query, max_results=5: f"results for {query}",
}
result = dispatch_tool_call(registry, "search_web", json.dumps({"query": "test"}))
assert "results for test" in result
def test_dispatch_unknown_tool_returns_error():
result = dispatch_tool_call({}, "nonexistent_tool", "{}")
data = json.loads(result)
assert data["retryable"] is False
assert "Unknown tool" in data["error"]
def test_dispatch_malformed_args_returns_error():
result = dispatch_tool_call({"fn": lambda: None}, "fn", "not valid json")
data = json.loads(result)
assert "Invalid tool arguments" in data["error"]
CI/CD Pipeline Design
Prompt, tool, or agent logic
<1 min
Mock tools, ~2 min
Golden tasks, ~15 min
5% live traffic
Prompt and Tool Version Control
Canary Deployments and Rollback
Why agent rollbacks are harder
Rolling back a web server restores old code. Rolling back an agent may also require: (1) restoring old system prompt templates, (2) reverting tool schema changes (if the old schema is incompatible with new tool function signatures), and (3) clearing or migrating checkpointed state that was written by the new version. Plan your versioning strategy before you ship.
import hashlib
def get_agent_version_for_user(user_id: str, canary_percentage: int = 5) -> str:
"""
Route a fraction of users to the new agent version.
Deterministic: same user always gets same version (no flickering).
"""
# Deterministic hash-based assignment
bucket = int(hashlib.md5(user_id.encode()).hexdigest()[:4], 16) % 100
if bucket < canary_percentage:
return "v2" # new version
return "v1" # stable version
def create_agent(version: str) -> AgentCore:
configs = {
"v1": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V1, "tools": TOOLS_V1},
"v2": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V2, "tools": TOOLS_V2},
}
cfg = configs[version]
return AgentCore(
llm_client=client,
model=cfg["model"],
system_prompt=cfg["system_prompt"],
tools=cfg["tools"],
)
# In your request handler:
def handle_request(user_id: str, goal: str) -> str:
version = get_agent_version_for_user(user_id)
agent = create_agent(version)
track_metric("agent_version", version) # monitor per-version metrics
return agent.run(goal)
Rollback decision criteria
Automatically trigger a rollback if within the first 30 minutes of canary: task success rate drops more than 5 percentage points, cost per task increases more than 30%, or P99 latency increases more than 50%. These thresholds should be calibrated to your specific system during stable baseline periods.
Chapter 19 Quiz
1. Why can't you use a standard "assert output == expected" unit test for an agent's final answer?
2. In a canary deployment, why should user routing be deterministic (hash-based)?
3. Why is rolling back an agent harder than rolling back a standard web application?