Course Building Agentic AI Systems Chapter 19 Difficulty advanced Estimated Time 600 min

Chapter 19: CI/CD for Agents

CI/CD for Agents in Building Agentic AI Systems.

86% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind CI/CD for Agents.
  • Apply CI/CD for Agents to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 19: CI/CD for Agents

Testing pipelines, golden trajectories, canary deployments, and rollback

Why Agent CI/CD is Harder Than Standard CI/CD

Standard CI/CD tests deterministic code: given input X, expect output Y. Agents are non-deterministic: the same input may produce different outputs on different runs (due to LLM temperature, API latency, and tool result variation). This means traditional unit tests must be supplemented with probabilistic and trajectory-based approaches.

Standard CI/CD

  • Unit test: assert output == expected
  • Integration test: end-to-end flow
  • One correct answer per test case
  • Deterministic pass/fail
  • Fast (seconds)

Agent CI/CD

  • Unit test: tool schemas, prompt parsing, dispatchers
  • Trajectory test: N runs, measure success rate
  • Many acceptable answers per task
  • Probabilistic pass (e.g., 95% success rate)
  • Slow (minutes per eval run)

Testing Layers

1
Unit Tests โ€” deterministic components Test: tool schema validity, tool dispatcher routing, prompt template rendering, context assembler output, state schema validation. These are fully deterministic and run in milliseconds.
2
Integration Tests โ€” agent in sandbox Run the full agent against a mock tool environment. Tools return fixture responses. Measure: does the agent call the right tools? Does it terminate correctly? Runs in seconds.
3
Golden Trajectory Regression Run the agent on your golden task set. Compare trajectories against reference solutions. Flag degradations in task success rate or step efficiency. Runs in minutes.
4
A/B / Shadow Mode Run new agent version in parallel with production on live traffic; compare outcomes without exposing users to new version. Most expensive but most realistic.
python โ€” unit testing a tool dispatcher
import pytest
import json
from my_agent.tools import dispatch_tool_call

def test_dispatch_routes_correctly():
    registry = {
        "search_web": lambda query, max_results=5: f"results for {query}",
    }
    result = dispatch_tool_call(registry, "search_web", json.dumps({"query": "test"}))
    assert "results for test" in result

def test_dispatch_unknown_tool_returns_error():
    result = dispatch_tool_call({}, "nonexistent_tool", "{}")
    data = json.loads(result)
    assert data["retryable"] is False
    assert "Unknown tool" in data["error"]

def test_dispatch_malformed_args_returns_error():
    result = dispatch_tool_call({"fn": lambda: None}, "fn", "not valid json")
    data = json.loads(result)
    assert "Invalid tool arguments" in data["error"]

CI/CD Pipeline Design

๐Ÿ“
Code Change

Prompt, tool, or agent logic

โ†’
๐Ÿงช
Unit Tests

<1 min

โ†’
๐Ÿ”—
Integration

Mock tools, ~2 min

โ†’
๐Ÿ“Š
Eval Suite

Golden tasks, ~15 min

โ†’
๐Ÿฆ
Canary

5% live traffic

โ†’
๐Ÿš€
Full Deploy

Prompt and Tool Version Control

1
Store prompts in version control (git)System prompts and few-shot examples are code โ€” treat them like code. Every change is a commit with a message explaining the intent.
2
Semantic versioning for prompt templatesMAJOR: breaking changes (different output format). MINOR: improvements to same behavior. PATCH: typo fixes, clarifications.
3
Feature flags for agent capabilitiesDisable new tools/behaviors at runtime without deploying new code. A feature flag lets you roll out a new tool to 10% of users while monitoring for problems.

Canary Deployments and Rollback

Why agent rollbacks are harder

Rolling back a web server restores old code. Rolling back an agent may also require: (1) restoring old system prompt templates, (2) reverting tool schema changes (if the old schema is incompatible with new tool function signatures), and (3) clearing or migrating checkpointed state that was written by the new version. Plan your versioning strategy before you ship.

python โ€” canary rollout with feature flags
import hashlib

def get_agent_version_for_user(user_id: str, canary_percentage: int = 5) -> str:
    """
    Route a fraction of users to the new agent version.
    Deterministic: same user always gets same version (no flickering).
    """
    # Deterministic hash-based assignment
    bucket = int(hashlib.md5(user_id.encode()).hexdigest()[:4], 16) % 100

    if bucket < canary_percentage:
        return "v2"   # new version
    return "v1"       # stable version

def create_agent(version: str) -> AgentCore:
    configs = {
        "v1": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V1, "tools": TOOLS_V1},
        "v2": {"model": "gpt-4o", "system_prompt": SYSTEM_PROMPT_V2, "tools": TOOLS_V2},
    }
    cfg = configs[version]
    return AgentCore(
        llm_client=client,
        model=cfg["model"],
        system_prompt=cfg["system_prompt"],
        tools=cfg["tools"],
    )

# In your request handler:
def handle_request(user_id: str, goal: str) -> str:
    version = get_agent_version_for_user(user_id)
    agent = create_agent(version)
    track_metric("agent_version", version)   # monitor per-version metrics
    return agent.run(goal)

Rollback decision criteria

Automatically trigger a rollback if within the first 30 minutes of canary: task success rate drops more than 5 percentage points, cost per task increases more than 30%, or P99 latency increases more than 50%. These thresholds should be calibrated to your specific system during stable baseline periods.

Chapter 19 Quiz

1. Why can't you use a standard "assert output == expected" unit test for an agent's final answer?

2. In a canary deployment, why should user routing be deterministic (hash-based)?

3. Why is rolling back an agent harder than rolling back a standard web application?