Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind Reasoning Deep-Dive.
Apply Reasoning Deep-Dive to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 1 — Foundations & Mental Models

Chapter 4: Reasoning Deep-Dive

Chain-of-thought, reasoning models, and when to search instead of think

How Agents Think

An agent's behavior quality is bounded by two things: the quality of its reasoning and the quality of its information. This chapter is about reasoning. Information access (tools, RAG, memory) is covered in Chapters 5–7.

Reasoning in the context of agents means something specific: the process of moving from the current state of observations to a decision about the next action. Poor reasoning leads to choosing the wrong tool, misinterpreting tool results, or failing to recognize when the goal is complete.

Reasoning quality is not the same as model intelligence

A well-prompted GPT-4o with explicit chain-of-thought often outperforms o3 with a minimal prompt on agent tasks. Reasoning quality is a function of model capability × prompt structure × task difficulty. Optimizing the prompt is always cheaper than switching to a larger model.

Chain-of-Thought (CoT)

Chain-of-thought prompting (Wei et al., 2022) asks the model to reason step-by-step before giving a final answer. The mechanism: explicitly producing intermediate reasoning steps causes the model to allocate more compute (attention) to the problem, which dramatically reduces errors on multi-step tasks.

Direct Prompt (No CoT)

User: "Should I search the web or use the database?"
Model: "Use the database."
No reasoning exposed
Error rate: high on complex decisions
Faster, lower token cost

CoT Prompt

User: same question, but "Think step by step"
Model: "The question asks about current events. My training cutoff is 2024. Current events are not in the database. Therefore I should search the web."
Reasoning is auditable
Error rate: lower on complex decisions

CoT in the Agent Context

In a ReAct agent, the Thought step is CoT made explicit. The scratchpad — the Thought text before each Action — is the chain-of-thought. This is why ReAct is effective: it forces the model to reason before acting on every step, not just at the end.

Example ReAct scratchpad — the Thought IS the chain-of-thought

Thought: The user wants to know the current stock price of NVDA.
        I cannot answer from memory because stock prices change in real time.
        I should call the `get_stock_price` tool with ticker="NVDA".

Action: get_stock_price({"ticker": "NVDA"})
Observation: {"ticker": "NVDA", "price": 132.45, "currency": "USD", "timestamp": "2026-05-04T17:02:00Z"}

Thought: I now have the current price. No additional tool calls needed.
        I should report the price clearly with the timestamp.

Final Answer: NVDA is currently trading at $132.45 USD (as of 17:02 UTC, May 4 2026).

Advanced CoT Variants

Self-Consistency

Sample N reasoning chains independently; take the majority answer. Works well when the answer space is discrete.

Use when: answer correctness matters more than cost

Tree-of-Thoughts

Explore multiple reasoning branches in a tree structure, evaluating and pruning at each node.

Use when: the solution space has branching paths (e.g., planning, math proofs)

Mind-Map Agent

Constructs a knowledge graph during reasoning to track logical relationships across many retrieved facts.

Use when: tasks require synthesizing many documents (deep research)

Reasoning Models

Standard LLMs like GPT-4o generate the next token directly from the prompt. Reasoning models (o1, o3, QwQ, DeepSeek-R1) are trained to produce an extended internal scratchpad — sometimes called "thinking tokens" — before generating the visible response. This scratchpad is akin to forced chain-of-thought, happening at inference time.

Model	Approach	Strength in Agents	Weakness
GPT-4o	Standard next-token prediction with RLHF	Fast, broad capability, strong tool calling	Multi-step reasoning errors on complex tasks
o3 / o4-mini	Extended thinking via RL on verifiable tasks	Long-horizon planning, math/code reasoning	Higher latency and cost per step
DeepSeek-R1	Distilled reasoning from large R1 model	Open weights, strong STEM, cost-efficient	Weaker on non-STEM knowledge tasks
QwQ / Qwen3	Thinking mode toggled via system prompt	Flexible: fast mode or thinking mode	Newer, less production data
Claude 3.5/4 (extended thinking)	Extended thinking budget via API parameter	Very strong instruction following + reasoning	Expensive at high thinking budgets

When to pay for a reasoning model

Use a reasoning model (o3, R1, QwQ thinking mode) when individual steps require complex planning or mathematical precision — e.g., a planner agent decomposing a 20-step research task. Use a standard model for routine tool dispatch steps (e.g., "given the search query the planner generated, call the search API") where reasoning depth is not the bottleneck.

When to Search Instead of Reason

One of the most impactful decisions in agent design is distinguishing between knowledge that should be reasoned about (known facts, logical inferences) and knowledge that must be retrieved (current events, proprietary data, anything after the model's training cutoff).

The SMTL Pattern (Search More, Think Less)

Research from 2026 (SMTL framework) showed that replacing sequential long reasoning chains with parallel evidence acquisition followed by synthesis reduced average reasoning steps by 70.7% while achieving state-of-the-art results on GAIA (75.7%) and BrowseComp (48.6%). The insight: an agent that retrieves 10 sources in parallel and then synthesizes is faster and more accurate than one that reasons at length from an incomplete knowledge base.

❓

Question

→

🤔

Is this factual / current?

→

🔍

Search / Retrieve

Real-time, proprietary

→

🧠

Reason

Logic, math, inference

→

✅

Answer

Practical heuristics

Task	Correct approach	Wrong approach
Calculate the sum of a list	Run code (tool)	Reason numerically in CoT (error-prone)
Latest model benchmark scores	Web search	Rely on training data (stale)
Design a plan to refactor code	Deep reasoning (o3)	Simple search (no single document answers this)
Current exchange rates	Finance API tool	Any form of reasoning from memory
Explain gradient descent	Standard LLM (GPT-4o)	Reasoning model (overkill; waste of cost)