Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind CI/CD for Agents.
Apply CI/CD for Agents to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 4 — Production Engineering

Chapter 19: CI/CD for Agents

Testing pipelines, golden trajectories, canary deployments, and rollback

Why Agent CI/CD is Harder Than Standard CI/CD

Standard CI/CD tests deterministic code: given input X, expect output Y. Agents are non-deterministic: the same input may produce different outputs on different runs (due to LLM temperature, API latency, and tool result variation). This means traditional unit tests must be supplemented with probabilistic and trajectory-based approaches.

Standard CI/CD

Unit test: assert output == expected
Integration test: end-to-end flow
One correct answer per test case
Deterministic pass/fail
Fast (seconds)

Agent CI/CD

Unit test: tool schemas, prompt parsing, dispatchers
Trajectory test: N runs, measure success rate
Many acceptable answers per task
Probabilistic pass (e.g., 95% success rate)
Slow (minutes per eval run)

Testing Layers

1

                                Unit Tests — deterministic components
                                Test: tool schema validity, tool dispatcher routing, prompt template rendering, context assembler output, state schema validation. These are fully deterministic and run in milliseconds.
                            
2

                                Integration Tests — agent in sandbox
                                Run the full agent against a mock tool environment. Tools return fixture responses. Measure: does the agent call the right tools? Does it terminate correctly? Runs in seconds.
                            
3

                                Golden Trajectory Regression
                                Run the agent on your golden task set. Compare trajectories against reference solutions. Flag degradations in task success rate or step efficiency. Runs in minutes.
                            
4

                                A/B / Shadow Mode
                                Run new agent version in parallel with production on live traffic; compare outcomes without exposing users to new version. Most expensive but most realistic.
                            

python — unit testing a tool dispatcher

import pytest
import json
from my_agent.tools import dispatch_tool_call

def test_dispatch_routes_correctly():
    registry = {
        "search_web": lambda query, max_results=5: f"results for {query}",
    }
    result = dispatch_tool_call(registry, "search_web", json.dumps({"query": "test"}))
    assert "results for test" in result

def test_dispatch_unknown_tool_returns_error():
    result = dispatch_tool_call({}, "nonexistent_tool", "{}")
    data = json.loads(result)
    assert data["retryable"] is False
    assert "Unknown tool" in data["error"]

def test_dispatch_malformed_args_returns_error():
    result = dispatch_tool_call({"fn": lambda: None}, "fn", "not valid json")
    data = json.loads(result)
    assert "Invalid tool arguments" in data["error"]

CI/CD Pipeline Design

📝

Code Change

Prompt, tool, or agent logic

→

🧪

Unit Tests

<1 min

→

🔗

Integration

Mock tools, ~2 min

→

📊

Eval Suite

Golden tasks, ~15 min

→

🐦

Canary

5% live traffic

→

🚀

Full Deploy

Prompt and Tool Version Control

1
Store prompts in version control (git)System prompts and few-shot examples are code — treat them like code. Every change is a commit with a message explaining the intent.
2
Semantic versioning for prompt templatesMAJOR: breaking changes (different output format). MINOR: improvements to same behavior. PATCH: typo fixes, clarifications.
3
Feature flags for agent capabilitiesDisable new tools/behaviors at runtime without deploying new code. A feature flag lets you roll out a new tool to 10% of users while monitoring for problems.

Canary Deployments and Rollback

Why agent rollbacks are harder

Rolling back a web server restores old code. Rolling back an agent may also require: (1) restoring old system prompt templates, (2) reverting tool schema changes (if the old schema is incompatible with new tool function signatures), and (3) clearing or migrating checkpointed state that was written by the new version. Plan your versioning strategy before you ship.

python — canary rollout with feature flags

import hashlib

def get_agent_version_for_user(user_id: str, canary_percentage: int = 5) -> str:
    """
    Route a fraction of users to the new agent version.
    Deterministic: same user always gets same version (no flickering).
    """
    # Deterministic hash-based assignment
    bucket = int(hashlib.md5(user_id.encode()).hexdigest()[:4], 16) % 100

    if bucket < canary_percentage:
        return "v2"   # new version
    return "v1"       # stable version

def create_agent(version: str) -> AgentCore:
    configs = {
        "v1": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V1, "tools": TOOLS_V1},
        "v2": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V2, "tools": TOOLS_V2},
    }
    cfg = configs[version]
    return AgentCore(
        llm_client=client,
        model=cfg["model"],
        system_prompt=cfg["system_prompt"],
        tools=cfg["tools"],
    )

# In your request handler:
def handle_request(user_id: str, goal: str) -> str:
    version = get_agent_version_for_user(user_id)
    agent = create_agent(version)
    track_metric("agent_version", version)   # monitor per-version metrics
    return agent.run(goal)

Rollback decision criteria

Automatically trigger a rollback if within the first 30 minutes of canary: task success rate drops more than 5 percentage points, cost per task increases more than 30%, or P99 latency increases more than 50%. These thresholds should be calibrated to your specific system during stable baseline periods.