Course Building Agentic AI Systems Chapter 18 Difficulty advanced Estimated Time 600 min

Chapter 18: Deployment & Scaling

Deployment & Scaling in Building Agentic AI Systems.

82% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind Deployment & Scaling.
  • Apply Deployment & Scaling to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 18: Deployment & Scaling

Stateless vs stateful, async queues, model routing, cost optimization

Deploying Agents at Scale

An agent that works in a notebook is not production-ready. Production deployment requires decisions about state management, async execution, load balancing, caching, and cost. This chapter covers each.

Key difference from regular API deployment

A standard API request completes in milliseconds and is stateless. An agent task may take minutes, maintains state across many LLM/tool calls, and has unpredictable resource usage (a task requiring 3 tool calls is very different from one requiring 25). This means standard web server deployment patterns must be augmented with async task queues, checkpointing, and fine-grained cost tracking.

Stateless vs Stateful Deployment

Stateless Agent

  • No state persisted between requests
  • Each request is a fresh start
  • Horizontally scalable (any server handles any request)
  • Simple; no session affinity needed
  • Use for: single-turn, short-task agents

Stateful Agent

  • State persisted in external store (Redis, DB)
  • Requests in the same session share state
  • Requires sticky routing OR session key lookup
  • Supports resumable tasks, HITL, multi-session memory
  • Use for: multi-turn assistants, long-running workflows
Request
API Gateway / Load Balancer Routes by session_id for stateful; random round-robin for stateless
Worker
Agent Worker A Runs agent loop; reads/writes state store
Agent Worker B Runs agent loop; reads/writes state store
Agent Worker N… Auto-scaled based on queue depth
State
Redis / DynamoDB Session state, checkpoints, working memory
Vector DB Long-term semantic memory

Async Processing with Task Queues

A user should not wait synchronously for a 3-minute agent task to complete. Use an async task queue: the API receives the request, creates a task, returns a task_id immediately, and the result is fetched later via polling or webhook.

python — async agent task with Celery
from celery import Celery
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import uuid, redis

celery_app = Celery("agent_tasks", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
api = FastAPI()
r = redis.Redis(host="localhost", port=6379, db=2)

class AgentRequest(BaseModel):
    goal: str
    user_id: str

# ── Celery task: runs the agent loop asynchronously ──────────────────────────

@celery_app.task(bind=True, max_retries=3)
def run_agent_task(self, task_id: str, goal: str, user_id: str) -> dict:
    try:
        r.hset(f"task:{task_id}", mapping={"status": "running", "progress": "0"})
        result = agent_core.run(goal)          # your agent from earlier chapters
        r.hset(f"task:{task_id}", mapping={"status": "done", "result": result})
        return {"status": "done", "result": result}
    except Exception as exc:
        r.hset(f"task:{task_id}", mapping={"status": "failed", "error": str(exc)})
        raise self.retry(exc=exc, countdown=60)

# ── API endpoints ─────────────────────────────────────────────────────────────

@api.post("/tasks")
def create_task(req: AgentRequest) -> dict:
    task_id = str(uuid.uuid4())
    run_agent_task.delay(task_id, req.goal, req.user_id)
    return {"task_id": task_id, "status": "queued"}

@api.get("/tasks/{task_id}")
def get_task_status(task_id: str) -> dict:
    data = r.hgetall(f"task:{task_id}")
    if not data:
        return {"error": "Task not found"}
    return {k.decode(): v.decode() for k, v in data.items()}

Cost Optimization

LLM costs dominate agent infrastructure costs. Understanding where money goes is the first step to cutting it.

1
Input tokens (largest cost)System prompt + tool schemas + conversation history + memory retrievals. Reduce by: shorter system prompt, fewer tools per turn, aggressive context eviction.
2
Output tokensModel's reasoning + action + answer. Reduce by: structured output (JSON vs prose), instructing the model to be concise.
3
Model tiero3 costs ~20× more than gpt-4o-mini. Most agent steps are simple (dispatch a tool, format a result) — use the cheapest model that can handle each step.
4
Prompt cachingAnthropic and OpenAI offer prompt prefix caching — the system prompt and static context are cached at the API level. ~90% cost reduction on the cached prefix.
5
Semantic cachingCache agent responses by query similarity — if the same question was answered recently with high confidence, return the cached result. Requires TTL and a freshness heuristic.

Model Routing Pattern

python — route steps to cheapest viable model
def select_model_for_step(step_type: str, task_complexity: str) -> str:
    """
    Route to the cheapest model that can handle the step reliably.
    """
    routing_table = {
        # (step_type, complexity) → model
        ("tool_dispatch",    "simple"):   "gpt-4o-mini",
        ("tool_dispatch",    "complex"):  "gpt-4o",
        ("reasoning",        "simple"):   "gpt-4o-mini",
        ("reasoning",        "complex"):  "gpt-4o",
        ("planning",         "any"):      "o3-mini",     # planning benefits from reasoning model
        ("final_synthesis",  "simple"):   "gpt-4o",
        ("final_synthesis",  "complex"):  "o3",          # highest quality for final output
    }
    return routing_table.get((step_type, task_complexity), "gpt-4o")

Auto-scaling based on queue depth

Scale your agent workers horizontally based on Celery queue depth (or equivalent). A queue depth of 0 → scale down to minimum workers. Queue depth rising above 10 → scale up. This prevents both underprovisioning (slow tasks) and overprovisioning (wasted cost when idle).

Chapter 18 Quiz

1. Why is synchronous HTTP request handling usually insufficient for production agent tasks?

2. Prompt prefix caching reduces costs by:

3. When should you use an expensive reasoning model (o3) in your model routing strategy?