Agent-R: Self-Training for LLM Agents
- Agent-R is an iterative self-training framework that equips LLM agents to self-reflect and repair mistakes in long-horizon, partially observable environments.
- It leverages Monte Carlo Tree Search to generate paired good and bad trajectories, creating step-level revision data for precise policy updates.
- Empirical evaluations in environments like WebShop and ScienceWorld demonstrate enhanced agent resilience and increased performance over traditional behavior cloning methods.
Agent-R is an iterative self-training framework designed to equip LLM agents with the capability to reflect on and correct their own errors in interactive, multi-turn environments. Unlike prior approaches based solely on behavior cloning (BC) or reward-penalty methods that focus on terminal outcomes, Agent-R emphasizes timely detection and recovery from mistakes within long-horizon partially observable Markov decision processes (POMDPs). The core innovation lies in automatically constructing step-level “revision” data by combining Monte Carlo Tree Search (MCTS) for trajectory exploration with a model-guided critique mechanism that identifies the earliest self-detectable error, forming the basis for targeted policy updates and improved agent resilience (Yuan et al., 20 Jan 2025).
1. Motivations and Background
Contemporary LLM agents trained with behavior cloning on optimal trajectories often exhibit brittleness in interactive settings: any early deviation from expert behavior typically cascades into compounding failures or action loops, as the model lacks exposure to error states and recovery strategies. Scalar reward-based approaches such as direct preference optimization (DPO) or standard reinforcement learning (RL) only weakly inform the policy due to sparse and delayed feedback, further limiting the agent’s capacity for trajectory-level introspection and timely correction. Manual step-level critique curation is prohibitively expensive and not scalable across complex domains. Agent-R was developed to address these limitations by automating the generation of revision examples that explicitly teach the agent to “reflect and repair” its own actions at fine granularity (Yuan et al., 20 Jan 2025).
2. Monte Carlo Tree Search for Trajectory Data Generation
Agent-R leverages MCTS to systematically explore and partition the trajectory space in interactive environments, thereby furnishing raw material for constructing both good and bad rollouts.
Each node in the MCTS corresponds to a partial trajectory , with actions proposed by the current actor policy . The tree search operates as follows:
- Selection (UCT): The next node is chosen to maximize the sum of expected value and an exploration bonus:
- Expansion: Upon reaching a non-terminal leaf, generate children actions via (with temperature 1), expanding the tree.
- Simulation (Rollouts): Sample full rollouts from new leaves using a default policy, and assign terminal rewards .
- Backup: Monte Carlo averages are updated via backpropagation along visitation paths:
Through this process, Agent-R collects clusters of high-reward (good) and low-reward (bad) trajectories sharing common prefixes, enabling systematic revision data construction.
3. Model-Guided Critique and Revision Trajectory Construction
Crucial to Agent-R is a procedure for splicing trajectories at the minimal sufficient “reflection point,” not merely at endpoints. For each pair of trajectories derived from a shared prefix—one classified as good (0), one as bad (1)—the actor model operates in a verifier mode to detect the first erroneous action in 2 given the task instruction 3. The pseudocode for this transition point discovery is:
7
Once the minimal splice point 4 is identified, the revised trajectory 5 is constructed as:
6
where 7 denotes a “revision signal” (e.g., explicit reflection text emitted by the agent), which is incorporated into supervised training.
4. Training Objective and Optimization
Agent-R utilizes a composite negative log-likelihood (NLL) objective comprising standard instruction-following data, optimal trajectories, and revision trajectories, parameterized by a mixture weight 8:
9
The agent is thus incentivized not only to imitate expert strategies but also to emit revision signals and resume optimal behavior after recognizing and reflecting upon its own errors.
5. Iterative Refinement Loop
Agent-R alternates between two phases across multiple iterations:
- Phase I: Collect fresh revision and good trajectories using the current actor via MCTS and model-guided splicing.
- Phase II: Supervised fine-tuning of the actor on the accumulated revision, good, and general data.
Across iterations, reflection accuracy improves (earlier error detection, shorter revision prefixes), and the thresholds defining “good” rollouts are made increasingly stringent (e.g., 0). This robustifies the agent against error accumulation, reduces loops, and leads to higher terminal rewards.
6. Empirical Evaluation
6.1 Environments
Agent-R was evaluated in three interactive settings:
- WebShop: Real-world web-shopping simulation.
- ScienceWorld: Text-based 5th-grade science QA environment.
- TextCraft: Minecraft-like crafting task.
6.2 Baselines
Comparisons included both closed-source LLMs (e.g., GPT-3.5/4.0, Claude-3), leading open-source agents (Llama-3.1-8B-Instruction, AgentLM, Agent-FLAN), contrastive preference tuning (ETO), and a “Direct-Revision” baseline (splicing only at rollout endpoints).
6.3 Results and Ablations
Agent-R attained an average reward of 70.71% (vs. 65.12% for ETO), a +5.59 percentage point absolute improvement. Ablation studies showed:
- Training solely on optimal rollouts markedly degraded self-reflection capacity, increasing undesirable loops.
- Early error-detection splice outperformed Direct-Revision.
- Performance steadily increased over three iterative cycles.
- Multi-task agent fine-tuning across environments outperformed task-specific training.
| Environment | Metric | Agent-R | Best Baseline (ETO) | Absolute Gain |
|---|---|---|---|---|
| WebShop | Avg. Reward | 70.71% | 65.12% | +5.59 pp |
| ScienceWorld | Avg. Reward | -- | -- | -- |
| TextCraft | Success Rate | -- | -- | -- |
Editor’s note: Table excerpts numerical results as provided for WebShop; full details referenced in (Yuan et al., 20 Jan 2025).
7. Hyperparameterization, Limitations, and Future Directions
Key training and search hyperparameters include:
- MCTS: 200–300 simulations per task; 8 rollouts per node expansion; max depth 1; 4 candidate actions per node; UCT constant 2.
- Revision pools: Bad/good separation threshold 3; good trajectory quality thresholds 4.
- Fine-tuning: 3 global iterations; initial epoch count 3 (then 1); learning rate 5 with 3% warmup; cosine schedule; batch size 1 per GPU (accumulation 16); max sequence 8192; gradient clip 1; agent data mixture weight 6.
Principal limitations:
- MCTS sampling is computationally intensive. A plausible implication is that scalable variants may require replacing tree search with learned critics or Q-functions capable of proposing revision candidates.
- Revision prompts (such as “ten revision thoughts”) currently require manual design.
- Agent-R is restricted to text-only environments; extending to multimodal or tool-augmented settings is an open research frontier.
- Potential improvements include leveraging a separate critic model to generate richer, higher-fidelity step-level feedback rather than relying solely on the self-verifier approach.
Agent-R constitutes the first framework to integrate (a) MCTS-guided generation of paired adverse and optimal rollouts, (b) actor-based early error detection to define revision points, and (c) iterative agent refinement on structured self-critique trajectories, resulting in substantial policy robustness gains for LLM agents in complex interactive domains (Yuan et al., 20 Jan 2025).