Chapter 17: Observability & Debugging
Observability & Debugging in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind Observability & Debugging.
- Apply Observability & Debugging to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 17: Observability & Debugging
Distributed tracing, spans, cost attribution, and debugging failed trajectories
Why Traditional App Monitoring is Insufficient
Standard APM tools (Datadog, New Relic) track HTTP latency, error rates, and CPU usage. This tells you that your agent is slow — but not why. Was it a slow LLM response? A tool API timeout? An infinite reasoning loop? Thirty redundant tool calls? Agent observability requires a domain-specific layer that understands the structure of agent execution.
Traditional APM
- HTTP response times
- Error counts and stack traces
- CPU / memory usage
- Database query latencies
- No concept of "reasoning steps"
Agent Observability
- Per-step LLM latency and token counts
- Tool call success/failure rates
- Reasoning quality (LLM-graded)
- Cost per step, per task, per user
- Full trajectory replay for debugging
Traces, Spans, and Events
Agent observability follows the same distributed tracing model as microservices — but the "services" are LLM calls, tool calls, and reasoning steps instead of HTTP requests.
Agent-Specific Signals to Monitor
| Signal | What it indicates when high | Action |
|---|---|---|
| Steps per task (P95) | Agent is looping or inefficient | Check for state change failures; add loop detection |
| Tool retry rate | API instability or rate limits | Add back-off; check upstream SLAs |
| Cost per task (P95) | Context growing unbounded | Add context eviction; check for large tool results |
| LLM call latency (P99) | Model overload or large context | Switch to faster model for cheap steps; reduce context |
| Task failure rate | Model degradation or prompt regression | Run regression eval; check if prompt changed |
Observability Platforms
| Platform | Strength | Best For |
|---|---|---|
| LangSmith | Deep LangChain/LangGraph integration; trajectory viewer with step-through replay | Teams using LangGraph in production |
| LangFuse | Open-source, self-hostable; cost tracking, prompt versioning | Cost-sensitive teams; privacy requirements |
| Arize Phoenix | LLM-native evaluation + monitoring; UMAP visualization of embedding clusters | Teams doing systematic evals alongside monitoring |
| Helicone | Simple drop-in proxy; caching, rate limiting, cost tracking with zero code changes | Quick observability without refactoring |
| Braintrust | Eval + tracing in one platform; collaborative prompt debugging | Teams iterating rapidly on prompts |
import os
# Set env vars — all LangChain/LangGraph calls automatically traced
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-agent-production"
os.environ["LANGCHAIN_API_KEY"] = "ls__..." # from LangSmith dashboard
# That's it — your existing LangGraph app now sends traces to LangSmith
result = app.invoke({"messages": [...]}, config=config)
# For non-LangChain code, use the @traceable decorator
from langsmith import traceable
@traceable(name="web_search_tool", run_type="tool")
def search_web(query: str) -> str:
# ... tool implementation
return result
Debugging Failed Trajectories
A failed agent trajectory is a sequence of (Thought, Action, Observation) steps that ends in a wrong answer, an error, or a timeout. The debugging process mirrors debugging regular code — but the "call stack" is a reasoning chain.
Detecting runaway costs
Set a per-session cost alert at 2–3× your expected average task cost. If a session exceeds the threshold, automatically terminate it and flag for manual review. Most runaway costs trace to: (1) infinite loops accumulating tool result tokens, (2) very large document ingestion without chunk limits, or (3) recursive agent spawning without depth limits.
Chapter 17 Quiz
1. What is a "trace" in the context of agent observability?
2. A high "steps per task (P95)" metric indicates which problem?
3. What is the correct first step when debugging a failed agent trajectory?