Chapter 17: Observability & Debugging

Why Traditional App Monitoring is Insufficient

Standard APM tools (Datadog, New Relic) track HTTP latency, error rates, and CPU usage. This tells you that your agent is slow — but not why. Was it a slow LLM response? A tool API timeout? An infinite reasoning loop? Thirty redundant tool calls? Agent observability requires a domain-specific layer that understands the structure of agent execution.

Traditional APM

HTTP response times
Error counts and stack traces
CPU / memory usage
Database query latencies
No concept of "reasoning steps"

Agent Observability

Per-step LLM latency and token counts
Tool call success/failure rates
Reasoning quality (LLM-graded)
Cost per step, per task, per user
Full trajectory replay for debugging

Traces, Spans, and Events

Agent observability follows the same distributed tracing model as microservices — but the "services" are LLM calls, tool calls, and reasoning steps instead of HTTP requests.

Trace

Full Task Execution (trace_id: abc123) Root span: user goal → final answer · Duration: 12.4s · Cost: $0.023

Spans

LLM Call (step 1) Model: gpt-4o · Tokens: 1,200 in + 350 out · Latency: 1.8s

Tool: search_web Query: "latest news on X" · Status: success · Latency: 0.9s

LLM Call (step 2) Model: gpt-4o · Tokens: 2,100 in + 480 out · Latency: 2.1s

Events

Tool retry Event logged on retry · Reason: RateLimitError · Retry #1 succeeded

Agent-Specific Signals to Monitor

Signal	What it indicates when high	Action
Steps per task (P95)	Agent is looping or inefficient	Check for state change failures; add loop detection
Tool retry rate	API instability or rate limits	Add back-off; check upstream SLAs
Cost per task (P95)	Context growing unbounded	Add context eviction; check for large tool results
LLM call latency (P99)	Model overload or large context	Switch to faster model for cheap steps; reduce context
Task failure rate	Model degradation or prompt regression	Run regression eval; check if prompt changed

Observability Platforms

Platform	Strength	Best For
LangSmith	Deep LangChain/LangGraph integration; trajectory viewer with step-through replay	Teams using LangGraph in production
LangFuse	Open-source, self-hostable; cost tracking, prompt versioning	Cost-sensitive teams; privacy requirements
Arize Phoenix	LLM-native evaluation + monitoring; UMAP visualization of embedding clusters	Teams doing systematic evals alongside monitoring
Helicone	Simple drop-in proxy; caching, rate limiting, cost tracking with zero code changes	Quick observability without refactoring
Braintrust	Eval + tracing in one platform; collaborative prompt debugging	Teams iterating rapidly on prompts

python — LangSmith tracing (zero-code integration)

import os

# Set env vars — all LangChain/LangGraph calls automatically traced
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-agent-production"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."   # from LangSmith dashboard

# That's it — your existing LangGraph app now sends traces to LangSmith
result = app.invoke({"messages": [...]}, config=config)

# For non-LangChain code, use the @traceable decorator
from langsmith import traceable

@traceable(name="web_search_tool", run_type="tool")
def search_web(query: str) -> str:
    # ... tool implementation
    return result

Debugging Failed Trajectories

A failed agent trajectory is a sequence of (Thought, Action, Observation) steps that ends in a wrong answer, an error, or a timeout. The debugging process mirrors debugging regular code — but the "call stack" is a reasoning chain.

1
Identify the failing trajectoryFrom your eval harness or production monitoring: task_id, session_id, timestamp
2
Retrieve the full traceOpen the trajectory in your tracing tool; you need every (Thought, Action, Observation) turn
3
Find the first wrong stepWork backward from the failure: what was the last correct state? What action caused the deviation?
4
Classify the failure modeWrong tool selected? Malformed arguments? Misread tool result? Context overflow? Classification determines the fix.
5
Fix and validateAdjust system prompt, tool description, context management, or eviction policy; run regression suite to verify the fix doesn't break other tasks

Detecting runaway costs

Set a per-session cost alert at 2–3× your expected average task cost. If a session exceeds the threshold, automatically terminate it and flag for manual review. Most runaway costs trace to: (1) infinite loops accumulating tool result tokens, (2) very large document ingestion without chunk limits, or (3) recursive agent spawning without depth limits.

By the end of this chapter, you will be able to:

Chapter 17: Observability & Debugging

Why Traditional App Monitoring is Insufficient

Traditional APM

Agent Observability

Traces, Spans, and Events

Agent-Specific Signals to Monitor

Observability Platforms

Debugging Failed Trajectories

Detecting runaway costs

Chapter 17 Quiz

By the end of this chapter, you will be able to:

Chapter 17: Observability & Debugging

Why Traditional App Monitoring is Insufficient

Traditional APM

Agent Observability

Traces, Spans, and Events

Agent-Specific Signals to Monitor

Observability Platforms

Debugging Failed Trajectories

Detecting runaway costs

Chapter 17 Quiz

Search