Course Building Agentic AI Systems Chapter 20 Difficulty advanced Estimated Time 600 min

Chapter 20: Fine-Tuning for Agentic Behavior

Fine-Tuning for Agentic Behavior in Building Agentic AI Systems.

91% complete

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the agentic AI concept behind Fine-Tuning for Agentic Behavior.
  • Apply Fine-Tuning for Agentic Behavior to design reliable, production-grade agent systems.
  • Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Chapter 20: Fine-Tuning for Agentic Behavior

SFT, RLVR, Tool-R0, ToolPO, ATLAS, and Trinity-RFT

Why Fine-Tune for Agentic Behavior?

General-purpose LLMs are trained primarily on text prediction — they are not optimized for the specific patterns of agentic workflows: consistent tool call formatting, multi-step planning, correct handling of tool errors, and restraint (knowing when not to call a tool).

Fine-tuning on agent-specific data improves: tool selection accuracy, argument formatting reliability, reasoning quality over multi-step tasks, and compliance with the instruction hierarchy. This chapter covers the leading approaches from 2024–2026.

When fine-tuning makes sense

  • Your tool schemas are domain-specific and not well-represented in pre-training
  • Consistency of tool call format is critical
  • Latency budget requires a smaller, faster model
  • You have high-quality trajectory data from production

When fine-tuning is premature

  • Your prompt engineering has not yet been optimized
  • You have fewer than ~500 high-quality trajectories
  • The target behavior changes frequently
  • You can't afford the evaluation infrastructure to validate fine-tuned behavior

Supervised Fine-Tuning (SFT) on Tool Demonstrations

SFT teaches the model to imitate expert trajectories. You provide a dataset of (goal, reference trajectory) pairs where the trajectory represents ideal (Thought, Action, Observation, …, Answer) sequences.

📚
Trajectory Dataset

Human-annotated or filtered production data

🔧
Data Format

Conversation-style with tool messages

🧠
SFT Training

Next-token prediction on trajectory tokens

Fine-tuned Model

SFT's limitation: it teaches imitation, not reasoning. A model fine-tuned only on SFT may fail on tasks that differ slightly from the training distribution, because it has learned what to do but not why.

Data quality beats data quantity

Research consistently shows that 500 high-quality, diverse trajectories outperform 5,000 mediocre ones. Quality criteria: (1) the trajectory actually solves the task, (2) tool calls are necessary (no redundant calls), (3) reasoning steps are sound, (4) trajectories cover diverse failure modes, not just happy paths.

RLVR and Reinforcement Fine-Tuning

Reinforcement Learning from Verifiable Rewards (RLVR) trains the model to maximize a reward signal based on whether the task outcome is correct — not to imitate a reference trajectory. The key insight from DeepSeek-R1: RL on tasks with verifiable rewards (math, code tests) produces strong emergent reasoning.

🤖
Policy Model

Current agent model

🎯
Run Task

Agent attempts the task

🏆
Verifiable Reward

Did the task succeed? +1 / -1

📈
Policy Update

GRPO / PPO gradient step

↺ iterate

Leading Methods (2025–2026)

MethodApproachKey Result
VerlToolRL on tool-use trajectories with tool execution as the verifier+15% tool accuracy on API-Call-Bench vs SFT-only
Trinity-RFTThree-stage: curriculum hard-case mining, process-level reward shaping, and KV-cache reuse for efficiencyState-of-the-art on MATH, AIME, and agent benchmarks
ATLASScales RLVR to small models (1B–7B) via tool-selection reward shaping7B model matches GPT-4o on tool benchmarks
Tool-R0RL via self-play: model generates tasks for itself and trains on outcomesDemonstrates RL-driven reasoning emergence for tools
ToolPOPreference optimization with fine-grained credit assignment to specific tool callsReduces hallucinated tool use by 40% vs standard DPO

Practical Fine-Tuning Recipes

Minimum Viable Recipe

1
Collect 500–2000 golden trajectoriesFrom human annotation or filtering high-quality production runs (success = true, steps < 15, no error retries)
2
SFT for 2–3 epochs on a 7B base modelUsing LoRA (rank 16, alpha 32) to minimize GPU memory. Full fine-tune if you have A100s.
3
Evaluate on held-out eval setCompare TSR, steps-per-task, and cost against the prompt-only baseline on the same tasks
4
Add RLVR if SFT TSR < 80%Use ToolPO or VerlTool if you have verifiable task outcomes. GRPO for compute efficiency.
python — format trajectory data for SFT
def format_trajectory_for_sft(trajectory: dict) -> dict:
    """
    Convert an agent trajectory into the OpenAI fine-tuning message format.
    trajectory = {
        "goal": str,
        "steps": [{"thought": str, "action": dict | None, "observation": str | None}],
        "final_answer": str,
    }
    """
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT},
        {"role": "user",   "content": trajectory["goal"]},
    ]

    for step in trajectory["steps"]:
        if step.get("thought"):
            # Include thought as part of the assistant turn before the action
            messages.append({
                "role": "assistant",
                "content": f"Thought: {step['thought']}",
            })
        if step.get("action"):
            messages.append({
                "role": "assistant",
                "content": None,
                "tool_calls": [{"function": step["action"]}],
            })
        if step.get("observation"):
            messages.append({
                "role": "tool",
                "content": step["observation"],
            })

    messages.append({"role": "assistant", "content": trajectory["final_answer"]})

    return {"messages": messages}

Chapter 20 Quiz

1. What is the key limitation of SFT-only fine-tuning for agents?

2. How does ToolPO improve over standard DPO for tool-use fine-tuning?

3. In the RLVR training loop, what provides the reward signal?