Chapter 18: Deployment & Scaling
Deployment & Scaling in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind Deployment & Scaling.
- Apply Deployment & Scaling to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 18: Deployment & Scaling
Stateless vs stateful, async queues, model routing, cost optimization
Deploying Agents at Scale
An agent that works in a notebook is not production-ready. Production deployment requires decisions about state management, async execution, load balancing, caching, and cost. This chapter covers each.
Key difference from regular API deployment
A standard API request completes in milliseconds and is stateless. An agent task may take minutes, maintains state across many LLM/tool calls, and has unpredictable resource usage (a task requiring 3 tool calls is very different from one requiring 25). This means standard web server deployment patterns must be augmented with async task queues, checkpointing, and fine-grained cost tracking.
Stateless vs Stateful Deployment
Stateless Agent
- No state persisted between requests
- Each request is a fresh start
- Horizontally scalable (any server handles any request)
- Simple; no session affinity needed
- Use for: single-turn, short-task agents
Stateful Agent
- State persisted in external store (Redis, DB)
- Requests in the same session share state
- Requires sticky routing OR session key lookup
- Supports resumable tasks, HITL, multi-session memory
- Use for: multi-turn assistants, long-running workflows
Async Processing with Task Queues
A user should not wait synchronously for a 3-minute agent task to complete. Use an async task queue: the API receives the request, creates a task, returns a task_id immediately, and the result is fetched later via polling or webhook.
from celery import Celery
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import uuid, redis
celery_app = Celery("agent_tasks", broker="redis://localhost:6379/0", backend="redis://localhost:6379/1")
api = FastAPI()
r = redis.Redis(host="localhost", port=6379, db=2)
class AgentRequest(BaseModel):
goal: str
user_id: str
# ── Celery task: runs the agent loop asynchronously ──────────────────────────
@celery_app.task(bind=True, max_retries=3)
def run_agent_task(self, task_id: str, goal: str, user_id: str) -> dict:
try:
r.hset(f"task:{task_id}", mapping={"status": "running", "progress": "0"})
result = agent_core.run(goal) # your agent from earlier chapters
r.hset(f"task:{task_id}", mapping={"status": "done", "result": result})
return {"status": "done", "result": result}
except Exception as exc:
r.hset(f"task:{task_id}", mapping={"status": "failed", "error": str(exc)})
raise self.retry(exc=exc, countdown=60)
# ── API endpoints ─────────────────────────────────────────────────────────────
@api.post("/tasks")
def create_task(req: AgentRequest) -> dict:
task_id = str(uuid.uuid4())
run_agent_task.delay(task_id, req.goal, req.user_id)
return {"task_id": task_id, "status": "queued"}
@api.get("/tasks/{task_id}")
def get_task_status(task_id: str) -> dict:
data = r.hgetall(f"task:{task_id}")
if not data:
return {"error": "Task not found"}
return {k.decode(): v.decode() for k, v in data.items()}
Cost Optimization
LLM costs dominate agent infrastructure costs. Understanding where money goes is the first step to cutting it.
Model Routing Pattern
def select_model_for_step(step_type: str, task_complexity: str) -> str:
"""
Route to the cheapest model that can handle the step reliably.
"""
routing_table = {
# (step_type, complexity) → model
("tool_dispatch", "simple"): "gpt-4o-mini",
("tool_dispatch", "complex"): "gpt-4o",
("reasoning", "simple"): "gpt-4o-mini",
("reasoning", "complex"): "gpt-4o",
("planning", "any"): "o3-mini", # planning benefits from reasoning model
("final_synthesis", "simple"): "gpt-4o",
("final_synthesis", "complex"): "o3", # highest quality for final output
}
return routing_table.get((step_type, task_complexity), "gpt-4o")
Auto-scaling based on queue depth
Scale your agent workers horizontally based on Celery queue depth (or equivalent). A queue depth of 0 → scale down to minimum workers. Queue depth rising above 10 → scale up. This prevents both underprovisioning (slow tasks) and overprovisioning (wasted cost when idle).
Chapter 18 Quiz
1. Why is synchronous HTTP request handling usually insufficient for production agent tasks?
2. Prompt prefix caching reduces costs by:
3. When should you use an expensive reasoning model (o3) in your model routing strategy?