Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind Fine-Tuning for Agentic Behavior.
Apply Fine-Tuning for Agentic Behavior to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 5 — Advanced Topics & Frontiers

Chapter 20: Fine-Tuning for Agentic Behavior

SFT, RLVR, Tool-R0, ToolPO, ATLAS, and Trinity-RFT

Why Fine-Tune for Agentic Behavior?

General-purpose LLMs are trained primarily on text prediction — they are not optimized for the specific patterns of agentic workflows: consistent tool call formatting, multi-step planning, correct handling of tool errors, and restraint (knowing when not to call a tool).

Fine-tuning on agent-specific data improves: tool selection accuracy, argument formatting reliability, reasoning quality over multi-step tasks, and compliance with the instruction hierarchy. This chapter covers the leading approaches from 2024–2026.

When fine-tuning makes sense

Your tool schemas are domain-specific and not well-represented in pre-training
Consistency of tool call format is critical
Latency budget requires a smaller, faster model
You have high-quality trajectory data from production

When fine-tuning is premature

Your prompt engineering has not yet been optimized
You have fewer than ~500 high-quality trajectories
The target behavior changes frequently
You can't afford the evaluation infrastructure to validate fine-tuned behavior

Supervised Fine-Tuning (SFT) on Tool Demonstrations

SFT teaches the model to imitate expert trajectories. You provide a dataset of (goal, reference trajectory) pairs where the trajectory represents ideal (Thought, Action, Observation, …, Answer) sequences.

📚

Trajectory Dataset

Human-annotated or filtered production data

→

🔧

Data Format

Conversation-style with tool messages

→

🧠

SFT Training

Next-token prediction on trajectory tokens

→

✅

Fine-tuned Model

SFT's limitation: it teaches imitation, not reasoning. A model fine-tuned only on SFT may fail on tasks that differ slightly from the training distribution, because it has learned what to do but not why.

Data quality beats data quantity

Research consistently shows that 500 high-quality, diverse trajectories outperform 5,000 mediocre ones. Quality criteria: (1) the trajectory actually solves the task, (2) tool calls are necessary (no redundant calls), (3) reasoning steps are sound, (4) trajectories cover diverse failure modes, not just happy paths.

RLVR and Reinforcement Fine-Tuning

Reinforcement Learning from Verifiable Rewards (RLVR) trains the model to maximize a reward signal based on whether the task outcome is correct — not to imitate a reference trajectory. The key insight from DeepSeek-R1: RL on tasks with verifiable rewards (math, code tests) produces strong emergent reasoning.

🤖

Policy Model

Current agent model

→

🎯

Run Task

Agent attempts the task

→

🏆

Verifiable Reward

Did the task succeed? +1 / -1

→

📈

Policy Update

GRPO / PPO gradient step

↺ iterate

Leading Methods (2025–2026)

Method	Approach	Key Result
VerlTool	RL on tool-use trajectories with tool execution as the verifier	+15% tool accuracy on API-Call-Bench vs SFT-only
Trinity-RFT	Three-stage: curriculum hard-case mining, process-level reward shaping, and KV-cache reuse for efficiency	State-of-the-art on MATH, AIME, and agent benchmarks
ATLAS	Scales RLVR to small models (1B–7B) via tool-selection reward shaping	7B model matches GPT-4o on tool benchmarks
Tool-R0	RL via self-play: model generates tasks for itself and trains on outcomes	Demonstrates RL-driven reasoning emergence for tools
ToolPO	Preference optimization with fine-grained credit assignment to specific tool calls	Reduces hallucinated tool use by 40% vs standard DPO

Practical Fine-Tuning Recipes

Minimum Viable Recipe

1
Collect 500–2000 golden trajectoriesFrom human annotation or filtering high-quality production runs (success = true, steps < 15, no error retries)
2
SFT for 2–3 epochs on a 7B base modelUsing LoRA (rank 16, alpha 32) to minimize GPU memory. Full fine-tune if you have A100s.
3
Evaluate on held-out eval setCompare TSR, steps-per-task, and cost against the prompt-only baseline on the same tasks
4
Add RLVR if SFT TSR < 80%Use ToolPO or VerlTool if you have verifiable task outcomes. GRPO for compute efficiency.

python — format trajectory data for SFT

def format_trajectory_for_sft(trajectory: dict) -> dict:
    """
    Convert an agent trajectory into the OpenAI fine-tuning message format.
    trajectory = {
        "goal": str,
        "steps": [{"thought": str, "action": dict | None, "observation": str | None}],
        "final_answer": str,
    }
    """
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT},
        {"role": "user",   "content": trajectory["goal"]},
    ]

    for step in trajectory["steps"]:
        if step.get("thought"):
            # Include thought as part of the assistant turn before the action
            messages.append({
                "role": "assistant",
                "content": f"Thought: {step['thought']}",
            })
        if step.get("action"):
            messages.append({
                "role": "assistant",
                "content": None,
                "tool_calls": [{"function": step["action"]}],
            })
        if step.get("observation"):
            messages.append({
                "role": "tool",
                "content": step["observation"],
            })

    messages.append({"role": "assistant", "content": trajectory["final_answer"]})

    return {"messages": messages}