Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind Deployment & Scaling.
Apply Deployment & Scaling to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 4 — Production Engineering

Chapter 18: Deployment & Scaling

Stateless vs stateful, async queues, model routing, cost optimization

Deploying Agents at Scale

An agent that works in a notebook is not production-ready. Production deployment requires decisions about state management, async execution, load balancing, caching, and cost. This chapter covers each.

Key difference from regular API deployment

A standard API request completes in milliseconds and is stateless. An agent task may take minutes, maintains state across many LLM/tool calls, and has unpredictable resource usage (a task requiring 3 tool calls is very different from one requiring 25). This means standard web server deployment patterns must be augmented with async task queues, checkpointing, and fine-grained cost tracking.

Stateless vs Stateful Deployment

Stateless Agent

No state persisted between requests
Each request is a fresh start
Horizontally scalable (any server handles any request)
Simple; no session affinity needed
Use for: single-turn, short-task agents

Stateful Agent

State persisted in external store (Redis, DB)
Requests in the same session share state
Requires sticky routing OR session key lookup
Supports resumable tasks, HITL, multi-session memory
Use for: multi-turn assistants, long-running workflows

Request

API Gateway / Load Balancer Routes by session_id for stateful; random round-robin for stateless

Worker

Agent Worker A Runs agent loop; reads/writes state store

Agent Worker B Runs agent loop; reads/writes state store

Agent Worker N… Auto-scaled based on queue depth

State

Redis / DynamoDB Session state, checkpoints, working memory

Vector DB Long-term semantic memory

Async Processing with Task Queues

A user should not wait synchronously for a 3-minute agent task to complete. Use an async task queue: the API receives the request, creates a task, returns a task_id immediately, and the result is fetched later via polling or webhook.

python — async agent task with Celery

from celery import Celery
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import uuid, redis

celery_app = Celery("agent_tasks", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
api = FastAPI()
r = redis.Redis(host="localhost", port=6379, db=2)

class AgentRequest(BaseModel):
    goal: str
    user_id: str

# ── Celery task: runs the agent loop asynchronously ──────────────────────────

@celery_app.task(bind=True, max_retries=3)
def run_agent_task(self, task_id: str, goal: str, user_id: str) -> dict:
    try:
        r.hset(f"task:{task_id}", mapping={"status": "running", "progress": "0"})
        result = agent_core.run(goal)          # your agent from earlier chapters
        r.hset(f"task:{task_id}", mapping={"status": "done", "result": result})
        return {"status": "done", "result": result}
    except Exception as exc:
        r.hset(f"task:{task_id}", mapping={"status": "failed", "error": str(exc)})
        raise self.retry(exc=exc, countdown=60)

# ── API endpoints ─────────────────────────────────────────────────────────────

@api.post("/tasks")
def create_task(req: AgentRequest) -> dict:
    task_id = str(uuid.uuid4())
    run_agent_task.delay(task_id, req.goal, req.user_id)
    return {"task_id": task_id, "status": "queued"}

@api.get("/tasks/{task_id}")
def get_task_status(task_id: str) -> dict:
    data = r.hgetall(f"task:{task_id}")
    if not data:
        return {"error": "Task not found"}
    return {k.decode(): v.decode() for k, v in data.items()}

Cost Optimization

LLM costs dominate agent infrastructure costs. Understanding where money goes is the first step to cutting it.

1
Input tokens (largest cost)System prompt + tool schemas + conversation history + memory retrievals. Reduce by: shorter system prompt, fewer tools per turn, aggressive context eviction.
2
Output tokensModel's reasoning + action + answer. Reduce by: structured output (JSON vs prose), instructing the model to be concise.
3
Model tiero3 costs ~20× more than gpt-4o-mini. Most agent steps are simple (dispatch a tool, format a result) — use the cheapest model that can handle each step.
4
Prompt cachingAnthropic and OpenAI offer prompt prefix caching — the system prompt and static context are cached at the API level. ~90% cost reduction on the cached prefix.
5
Semantic cachingCache agent responses by query similarity — if the same question was answered recently with high confidence, return the cached result. Requires TTL and a freshness heuristic.

Model Routing Pattern

python — route steps to cheapest viable model

def select_model_for_step(step_type: str, task_complexity: str) -> str:
    """
    Route to the cheapest model that can handle the step reliably.
    """
    routing_table = {
        # (step_type, complexity) → model
        ("tool_dispatch",    "simple"):   "gpt-4o-mini",
        ("tool_dispatch",    "complex"):  "gpt-4o",
        ("reasoning",        "simple"):   "gpt-4o-mini",
        ("reasoning",        "complex"):  "gpt-4o",
        ("planning",         "any"):      "o3-mini",     # planning benefits from reasoning model
        ("final_synthesis",  "simple"):   "gpt-4o",
        ("final_synthesis",  "complex"):  "o3",          # highest quality for final output
    }
    return routing_table.get((step_type, task_complexity), "gpt-4o")

Auto-scaling based on queue depth

Scale your agent workers horizontally based on Celery queue depth (or equivalent). A queue depth of 0 → scale down to minimum workers. Queue depth rising above 10 → scale up. This prevents both underprovisioning (slow tasks) and overprovisioning (wasted cost when idle).