Course Building Agentic AI Systems Chapter 17 Difficulty advanced Estimated Time 600 min

Chapter 17: Observability & Debugging

Observability & Debugging in Building Agentic AI Systems.

77% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind Observability & Debugging.
  • Apply Observability & Debugging to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 17: Observability & Debugging

Distributed tracing, spans, cost attribution, and debugging failed trajectories

Why Traditional App Monitoring is Insufficient

Standard APM tools (Datadog, New Relic) track HTTP latency, error rates, and CPU usage. This tells you that your agent is slow — but not why. Was it a slow LLM response? A tool API timeout? An infinite reasoning loop? Thirty redundant tool calls? Agent observability requires a domain-specific layer that understands the structure of agent execution.

Traditional APM

  • HTTP response times
  • Error counts and stack traces
  • CPU / memory usage
  • Database query latencies
  • No concept of "reasoning steps"

Agent Observability

  • Per-step LLM latency and token counts
  • Tool call success/failure rates
  • Reasoning quality (LLM-graded)
  • Cost per step, per task, per user
  • Full trajectory replay for debugging

Traces, Spans, and Events

Agent observability follows the same distributed tracing model as microservices — but the "services" are LLM calls, tool calls, and reasoning steps instead of HTTP requests.

Trace
Full Task Execution (trace_id: abc123) Root span: user goal → final answer · Duration: 12.4s · Cost: $0.023
Spans
LLM Call (step 1) Model: gpt-4o · Tokens: 1,200 in + 350 out · Latency: 1.8s
Tool: search_web Query: "latest news on X" · Status: success · Latency: 0.9s
LLM Call (step 2) Model: gpt-4o · Tokens: 2,100 in + 480 out · Latency: 2.1s
Events
Tool retry Event logged on retry · Reason: RateLimitError · Retry #1 succeeded

Agent-Specific Signals to Monitor

SignalWhat it indicates when highAction
Steps per task (P95)Agent is looping or inefficientCheck for state change failures; add loop detection
Tool retry rateAPI instability or rate limitsAdd back-off; check upstream SLAs
Cost per task (P95)Context growing unboundedAdd context eviction; check for large tool results
LLM call latency (P99)Model overload or large contextSwitch to faster model for cheap steps; reduce context
Task failure rateModel degradation or prompt regressionRun regression eval; check if prompt changed

Observability Platforms

PlatformStrengthBest For
LangSmithDeep LangChain/LangGraph integration; trajectory viewer with step-through replayTeams using LangGraph in production
LangFuseOpen-source, self-hostable; cost tracking, prompt versioningCost-sensitive teams; privacy requirements
Arize PhoenixLLM-native evaluation + monitoring; UMAP visualization of embedding clustersTeams doing systematic evals alongside monitoring
HeliconeSimple drop-in proxy; caching, rate limiting, cost tracking with zero code changesQuick observability without refactoring
BraintrustEval + tracing in one platform; collaborative prompt debuggingTeams iterating rapidly on prompts
python — LangSmith tracing (zero-code integration)
import os

# Set env vars — all LangChain/LangGraph calls automatically traced
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-agent-production"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."   # from LangSmith dashboard

# That's it — your existing LangGraph app now sends traces to LangSmith
result = app.invoke({"messages": [...]}, config=config)

# For non-LangChain code, use the @traceable decorator
from langsmith import traceable

@traceable(name="web_search_tool", run_type="tool")
def search_web(query: str) -> str:
    # ... tool implementation
    return result

Debugging Failed Trajectories

A failed agent trajectory is a sequence of (Thought, Action, Observation) steps that ends in a wrong answer, an error, or a timeout. The debugging process mirrors debugging regular code — but the "call stack" is a reasoning chain.

1
Identify the failing trajectoryFrom your eval harness or production monitoring: task_id, session_id, timestamp
2
Retrieve the full traceOpen the trajectory in your tracing tool; you need every (Thought, Action, Observation) turn
3
Find the first wrong stepWork backward from the failure: what was the last correct state? What action caused the deviation?
4
Classify the failure modeWrong tool selected? Malformed arguments? Misread tool result? Context overflow? Classification determines the fix.
5
Fix and validateAdjust system prompt, tool description, context management, or eviction policy; run regression suite to verify the fix doesn't break other tasks

Detecting runaway costs

Set a per-session cost alert at 2–3× your expected average task cost. If a session exceeds the threshold, automatically terminate it and flag for manual review. Most runaway costs trace to: (1) infinite loops accumulating tool result tokens, (2) very large document ingestion without chunk limits, or (3) recursive agent spawning without depth limits.

Chapter 17 Quiz

1. What is a "trace" in the context of agent observability?

2. A high "steps per task (P95)" metric indicates which problem?

3. What is the correct first step when debugging a failed agent trajectory?