Course Building Agentic AI Systems Chapter 5 Difficulty advanced Estimated Time 600 min

Chapter 5: Tool Use

Tool Use in Building Agentic AI Systems.

23% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind Tool Use.
  • Apply Tool Use to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 5: Tool Use

Anatomy, schemas, parallel calls, failures, and integration

Tool Use — The Agent's Hands

Tool use is what transforms a language model into an agent that can affect the world. Without tools, an agent can only generate text. With tools, it can search the web, run code, call APIs, read/write files, query databases, and trigger workflows.

How tool calling works at inference time

The model receives the tool schemas alongside the conversation. When it decides to use a tool, it generates a structured JSON object containing the tool name and arguments instead of a free-form text response. Your application code intercepts that JSON, executes the real function, and returns the result as a new message. The model never directly executes anything — it only decides what to call.

🤖
LLM

Generates JSON call

🔌
Dispatcher

Parses & routes

🛠
Tool Fn

Executes

📩
Result

Returned as tool message

🤖
LLM

Reads result, continues

Tool Anatomy

A well-designed tool has three parts: a clear name, a precise description, and a strict parameter schema. The description is read by the LLM at inference time — it is documentation for the model, not for developers.

python — well-designed tool definition
from pydantic import BaseModel, Field
from typing import Literal

# Use Pydantic for schema generation — avoids JSON schema hand-rolling

class WebSearchInput(BaseModel):
    query: str = Field(
        description="Specific, focused search query. Be precise — vague queries return poor results."
    )
    max_results: int = Field(
        default=5,
        ge=1, le=20,
        description="Number of results to retrieve. Use 3-5 for general queries."
    )

def search_web(query: str, max_results: int = 5) -> str:
    """
    Search the web for current information not in the training data.
    Use for: current events, real-time data, recent publications.
    Do NOT use for: mathematical computation, code execution, or static factual knowledge.
    Returns: top results as formatted text with titles, URLs, and snippets.
    """
    # Implementation: call search API (Brave, Serper, Tavily, etc.)
    results = _call_search_api(query, max_results)
    return _format_results(results)


# Convert to OpenAI tool schema
import openai_schema_generator  # or build manually

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": search_web.__doc__,
            "parameters": WebSearchInput.model_json_schema(),
            "strict": True,  # Forces exact schema adherence (OpenAI structured outputs)
        }
    }
]

Tool Design Rules

1
One tool, one responsibilityA tool that searches AND writes is harder to reason about and audit than two separate tools
2
Description tells the model WHEN to use itNot just what it does, but when NOT to use it — negative guidance prevents misuse
3
Idempotent tools are saferRead-only and idempotent write tools can be safely retried; non-idempotent ones (send email) need confirmation guards
4
Return structured output, not proseReturn JSON or structured text — the model parses the tool result; unstructured prose degrades parsing quality
5
Include error messages in the return typeA tool that raises an exception breaks the loop; return {"error": "...", "retryable": true} instead

Tool Call Flow & Failure Handling

python — robust tool dispatcher
import json
import time
from typing import Any

def dispatch_tool_call(
    tool_registry: dict[str, callable],
    tool_name: str,
    raw_arguments: str,
    max_retries: int = 2,
    retry_delay: float = 1.0,
) -> str:
    # 1. Parse arguments — catch malformed JSON from the LLM
    try:
        args = json.loads(raw_arguments)
    except json.JSONDecodeError as e:
        return json.dumps({"error": f"Invalid tool arguments: {e}", "retryable": False})

    # 2. Resolve tool — catch hallucinated tool names
    tool_fn = tool_registry.get(tool_name)
    if tool_fn is None:
        return json.dumps({
            "error": f"Unknown tool '{tool_name}'. Available: {list(tool_registry.keys())}",
            "retryable": False,
        })

    # 3. Execute with retry for transient failures (rate limits, timeouts)
    for attempt in range(max_retries + 1):
        try:
            result = tool_fn(**args)
            return result if isinstance(result, str) else json.dumps(result)
        except RateLimitError:
            if attempt < max_retries:
                time.sleep(retry_delay * (2 ** attempt))   # exponential back-off
                continue
            return json.dumps({"error": "Rate limit exceeded after retries", "retryable": True})
        except TimeoutError:
            return json.dumps({"error": "Tool timed out", "retryable": True})
        except Exception as e:
            return json.dumps({"error": str(e), "retryable": False})

Never let a tool exception propagate to the LLM loop

If a tool raises an uncaught exception, your agent loop crashes and the user sees an error. Wrap every tool execution in a try/except and return a structured error message. The LLM can then decide to retry, use an alternative tool, or report the problem gracefully.

Parallel vs Sequential Tool Calls

Modern LLMs (GPT-4o, Claude 3.5+) can request multiple tool calls in a single response turn. This is called parallel function calling. When the calls are independent, it can cut total latency dramatically.

Sequential

  • Call 1 → wait → Call 2 → wait → Call 3
  • Total latency = sum of all tool latencies
  • Necessary when: Call 2 depends on Call 1's result
  • Example: search → read top result → summarize

Parallel

  • Calls 1, 2, 3 dispatched simultaneously
  • Total latency = max(individual latencies)
  • Necessary when: all calls are independent
  • Example: search 3 topics simultaneously for a report
python — executing parallel tool calls
import asyncio
from openai.types.chat import ChatCompletionMessageToolCall

async def execute_parallel_tools(
    tool_calls: list[ChatCompletionMessageToolCall],
    tool_registry: dict[str, callable],
) -> list[dict]:
    """Execute all tool calls concurrently; return results in the same order."""

    async def _run_one(call: ChatCompletionMessageToolCall) -> dict:
        result = await asyncio.to_thread(
            dispatch_tool_call, tool_registry, call.function.name, call.function.arguments
        )
        return {"role": "tool", "tool_call_id": call.id, "content": result}

    return await asyncio.gather(*[_run_one(c) for c in tool_calls])

Tool selection when the list is large

Models perform poorly when given more than ~30 tools at once — the schema descriptions overload the context and the model struggles to select correctly. For large tool sets (50+), use semantic tool routing: embed the user query, find the top-K most semantically relevant tool descriptions, and only include those K tools in the current turn's context.

Chapter 5 Quiz

1. When the LLM "calls a tool," what actually executes the function?

2. Why should a tool's description specify when NOT to use it?

3. When are parallel tool calls NOT appropriate?