Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent-R: Self-Training for LLM Agents

Updated 1 April 2026
  • Agent-R is an iterative self-training framework that equips LLM agents to self-reflect and repair mistakes in long-horizon, partially observable environments.
  • It leverages Monte Carlo Tree Search to generate paired good and bad trajectories, creating step-level revision data for precise policy updates.
  • Empirical evaluations in environments like WebShop and ScienceWorld demonstrate enhanced agent resilience and increased performance over traditional behavior cloning methods.

Agent-R is an iterative self-training framework designed to equip LLM agents with the capability to reflect on and correct their own errors in interactive, multi-turn environments. Unlike prior approaches based solely on behavior cloning (BC) or reward-penalty methods that focus on terminal outcomes, Agent-R emphasizes timely detection and recovery from mistakes within long-horizon partially observable Markov decision processes (POMDPs). The core innovation lies in automatically constructing step-level “revision” data by combining Monte Carlo Tree Search (MCTS) for trajectory exploration with a model-guided critique mechanism that identifies the earliest self-detectable error, forming the basis for targeted policy updates and improved agent resilience (Yuan et al., 20 Jan 2025).

1. Motivations and Background

Contemporary LLM agents trained with behavior cloning on optimal trajectories often exhibit brittleness in interactive settings: any early deviation from expert behavior typically cascades into compounding failures or action loops, as the model lacks exposure to error states and recovery strategies. Scalar reward-based approaches such as direct preference optimization (DPO) or standard reinforcement learning (RL) only weakly inform the policy due to sparse and delayed feedback, further limiting the agent’s capacity for trajectory-level introspection and timely correction. Manual step-level critique curation is prohibitively expensive and not scalable across complex domains. Agent-R was developed to address these limitations by automating the generation of revision examples that explicitly teach the agent to “reflect and repair” its own actions at fine granularity (Yuan et al., 20 Jan 2025).

2. Monte Carlo Tree Search for Trajectory Data Generation

Agent-R leverages MCTS to systematically explore and partition the trajectory space in interactive environments, thereby furnishing raw material for constructing both good and bad rollouts.

Each node ss in the MCTS corresponds to a partial trajectory τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t}), with actions proposed by the current actor policy πθ(as)\pi_\theta(a|s). The tree search operates as follows:

  • Selection (UCT): The next node is chosen to maximize the sum of expected value and an exploration bonus:

snext=argmaxschildren(s)(Q(s)+cuctlnN(s)N(s))s_{\rm next} = \arg\max_{s' \in \mathrm{children}(s)} \Bigl(Q(s') + c_{\rm uct}\sqrt{\tfrac{\ln N(s)}{N(s')}}\Bigr)

  • Expansion: Upon reaching a non-terminal leaf, generate mm children actions via πθ\pi_\theta (with temperature 1), expanding the tree.
  • Simulation (Rollouts): Sample kk full rollouts from new leaves using a default policy, and assign terminal rewards r(τ)r(\tau).
  • Backup: Monte Carlo averages Q(s)Q(s) are updated via backpropagation along visitation paths:

N(s)N(s)+1,Q(s)Q(s)+1N(s)(rQ(s))N(s)\leftarrow N(s)+1,\quad Q(s)\leftarrow Q(s)+\tfrac{1}{N(s)}(r-Q(s))

Through this process, Agent-R collects clusters of high-reward (good) and low-reward (bad) trajectories sharing common prefixes, enabling systematic revision data construction.

3. Model-Guided Critique and Revision Trajectory Construction

Crucial to Agent-R is a procedure for splicing trajectories at the minimal sufficient “reflection point,” not merely at endpoints. For each pair of trajectories derived from a shared prefix—one classified as good (τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})0), one as bad (τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})1)—the actor model operates in a verifier mode to detect the first erroneous action in τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})2 given the task instruction τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})3. The pseudocode for this transition point discovery is:

πθ(as)\pi_\theta(a|s)7

Once the minimal splice point τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})4 is identified, the revised trajectory τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})5 is constructed as:

τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})6

where τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})7 denotes a “revision signal” (e.g., explicit reflection text emitted by the agent), which is incorporated into supervised training.

4. Training Objective and Optimization

Agent-R utilizes a composite negative log-likelihood (NLL) objective comprising standard instruction-following data, optimal trajectories, and revision trajectories, parameterized by a mixture weight τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})8:

τt=(a1,o1,...,at,ot)\tau_{t}=(a_{1},o_{1},...,a_{t},o_{t})9

The agent is thus incentivized not only to imitate expert strategies but also to emit revision signals and resume optimal behavior after recognizing and reflecting upon its own errors.

5. Iterative Refinement Loop

Agent-R alternates between two phases across multiple iterations:

  • Phase I: Collect fresh revision and good trajectories using the current actor via MCTS and model-guided splicing.
  • Phase II: Supervised fine-tuning of the actor on the accumulated revision, good, and general data.

Across iterations, reflection accuracy improves (earlier error detection, shorter revision prefixes), and the thresholds defining “good” rollouts are made increasingly stringent (e.g., πθ(as)\pi_\theta(a|s)0). This robustifies the agent against error accumulation, reduces loops, and leads to higher terminal rewards.

6. Empirical Evaluation

6.1 Environments

Agent-R was evaluated in three interactive settings:

  • WebShop: Real-world web-shopping simulation.
  • ScienceWorld: Text-based 5th-grade science QA environment.
  • TextCraft: Minecraft-like crafting task.

6.2 Baselines

Comparisons included both closed-source LLMs (e.g., GPT-3.5/4.0, Claude-3), leading open-source agents (Llama-3.1-8B-Instruction, AgentLM, Agent-FLAN), contrastive preference tuning (ETO), and a “Direct-Revision” baseline (splicing only at rollout endpoints).

6.3 Results and Ablations

Agent-R attained an average reward of 70.71% (vs. 65.12% for ETO), a +5.59 percentage point absolute improvement. Ablation studies showed:

  • Training solely on optimal rollouts markedly degraded self-reflection capacity, increasing undesirable loops.
  • Early error-detection splice outperformed Direct-Revision.
  • Performance steadily increased over three iterative cycles.
  • Multi-task agent fine-tuning across environments outperformed task-specific training.
Environment Metric Agent-R Best Baseline (ETO) Absolute Gain
WebShop Avg. Reward 70.71% 65.12% +5.59 pp
ScienceWorld Avg. Reward -- -- --
TextCraft Success Rate -- -- --

Editor’s note: Table excerpts numerical results as provided for WebShop; full details referenced in (Yuan et al., 20 Jan 2025).

7. Hyperparameterization, Limitations, and Future Directions

Key training and search hyperparameters include:

  • MCTS: 200–300 simulations per task; 8 rollouts per node expansion; max depth πθ(as)\pi_\theta(a|s)1; 4 candidate actions per node; UCT constant πθ(as)\pi_\theta(a|s)2.
  • Revision pools: Bad/good separation threshold πθ(as)\pi_\theta(a|s)3; good trajectory quality thresholds πθ(as)\pi_\theta(a|s)4.
  • Fine-tuning: 3 global iterations; initial epoch count 3 (then 1); learning rate πθ(as)\pi_\theta(a|s)5 with 3% warmup; cosine schedule; batch size 1 per GPU (accumulation 16); max sequence 8192; gradient clip 1; agent data mixture weight πθ(as)\pi_\theta(a|s)6.

Principal limitations:

  • MCTS sampling is computationally intensive. A plausible implication is that scalable variants may require replacing tree search with learned critics or Q-functions capable of proposing revision candidates.
  • Revision prompts (such as “ten revision thoughts”) currently require manual design.
  • Agent-R is restricted to text-only environments; extending to multimodal or tool-augmented settings is an open research frontier.
  • Potential improvements include leveraging a separate critic model to generate richer, higher-fidelity step-level feedback rather than relying solely on the self-verifier approach.

Agent-R constitutes the first framework to integrate (a) MCTS-guided generation of paired adverse and optimal rollouts, (b) actor-based early error detection to define revision points, and (c) iterative agent refinement on structured self-critique trajectories, resulting in substantial policy robustness gains for LLM agents in complex interactive domains (Yuan et al., 20 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-R.