Chapter 20: Fine-Tuning for Agentic Behavior
Fine-Tuning for Agentic Behavior in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind Fine-Tuning for Agentic Behavior.
- Apply Fine-Tuning for Agentic Behavior to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 20: Fine-Tuning for Agentic Behavior
SFT, RLVR, Tool-R0, ToolPO, ATLAS, and Trinity-RFT
Why Fine-Tune for Agentic Behavior?
General-purpose LLMs are trained primarily on text prediction — they are not optimized for the specific patterns of agentic workflows: consistent tool call formatting, multi-step planning, correct handling of tool errors, and restraint (knowing when not to call a tool).
Fine-tuning on agent-specific data improves: tool selection accuracy, argument formatting reliability, reasoning quality over multi-step tasks, and compliance with the instruction hierarchy. This chapter covers the leading approaches from 2024–2026.
When fine-tuning makes sense
- Your tool schemas are domain-specific and not well-represented in pre-training
- Consistency of tool call format is critical
- Latency budget requires a smaller, faster model
- You have high-quality trajectory data from production
When fine-tuning is premature
- Your prompt engineering has not yet been optimized
- You have fewer than ~500 high-quality trajectories
- The target behavior changes frequently
- You can't afford the evaluation infrastructure to validate fine-tuned behavior
Supervised Fine-Tuning (SFT) on Tool Demonstrations
SFT teaches the model to imitate expert trajectories. You provide a dataset of (goal, reference trajectory) pairs where the trajectory represents ideal (Thought, Action, Observation, …, Answer) sequences.
Human-annotated or filtered production data
Conversation-style with tool messages
Next-token prediction on trajectory tokens
SFT's limitation: it teaches imitation, not reasoning. A model fine-tuned only on SFT may fail on tasks that differ slightly from the training distribution, because it has learned what to do but not why.
Data quality beats data quantity
Research consistently shows that 500 high-quality, diverse trajectories outperform 5,000 mediocre ones. Quality criteria: (1) the trajectory actually solves the task, (2) tool calls are necessary (no redundant calls), (3) reasoning steps are sound, (4) trajectories cover diverse failure modes, not just happy paths.
RLVR and Reinforcement Fine-Tuning
Reinforcement Learning from Verifiable Rewards (RLVR) trains the model to maximize a reward signal based on whether the task outcome is correct — not to imitate a reference trajectory. The key insight from DeepSeek-R1: RL on tasks with verifiable rewards (math, code tests) produces strong emergent reasoning.
Current agent model
Agent attempts the task
Did the task succeed? +1 / -1
GRPO / PPO gradient step
Leading Methods (2025–2026)
| Method | Approach | Key Result |
|---|---|---|
| VerlTool | RL on tool-use trajectories with tool execution as the verifier | +15% tool accuracy on API-Call-Bench vs SFT-only |
| Trinity-RFT | Three-stage: curriculum hard-case mining, process-level reward shaping, and KV-cache reuse for efficiency | State-of-the-art on MATH, AIME, and agent benchmarks |
| ATLAS | Scales RLVR to small models (1B–7B) via tool-selection reward shaping | 7B model matches GPT-4o on tool benchmarks |
| Tool-R0 | RL via self-play: model generates tasks for itself and trains on outcomes | Demonstrates RL-driven reasoning emergence for tools |
| ToolPO | Preference optimization with fine-grained credit assignment to specific tool calls | Reduces hallucinated tool use by 40% vs standard DPO |
Practical Fine-Tuning Recipes
Minimum Viable Recipe
def format_trajectory_for_sft(trajectory: dict) -> dict:
"""
Convert an agent trajectory into the OpenAI fine-tuning message format.
trajectory = {
"goal": str,
"steps": [{"thought": str, "action": dict | None, "observation": str | None}],
"final_answer": str,
}
"""
messages = [
{"role": "system", "content": AGENT_SYSTEM_PROMPT},
{"role": "user", "content": trajectory["goal"]},
]
for step in trajectory["steps"]:
if step.get("thought"):
# Include thought as part of the assistant turn before the action
messages.append({
"role": "assistant",
"content": f"Thought: {step['thought']}",
})
if step.get("action"):
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [{"function": step["action"]}],
})
if step.get("observation"):
messages.append({
"role": "tool",
"content": step["observation"],
})
messages.append({"role": "assistant", "content": trajectory["final_answer"]})
return {"messages": messages}
Chapter 20 Quiz
1. What is the key limitation of SFT-only fine-tuning for agents?
2. How does ToolPO improve over standard DPO for tool-use fine-tuning?
3. In the RLVR training loop, what provides the reward signal?