Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 164 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Exploring Expert Failures Improves LLM Agent Tuning (2504.13145v2)

Published 17 Apr 2025 in cs.AI

Abstract: LLMs have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

Collections

Summary

The paper introduces the EEF technique that integrates beneficial actions from failed expert trajectories to improve LLM tuning on complex tasks.
EEF outperforms baselines by achieving a 62.0% win rate on WebShop 11k and an 81.3 reward on ScienceWorld, demonstrating superior exploration efficiency.
The method employs behavior cloning, targeted exploration of intermediate states, and selective reinforcement fine-tuning to mitigate simplicity bias.

This paper introduces Exploring Expert Failures (EEF), a novel fine-tuning method for LLM agents designed to improve performance on complex tasks where even expert models (like GPT-4) often fail.

The standard approach, Rejection Sampling Fine-Tuning (RFT), fine-tunes smaller LLMs on successful trajectories generated by an expert model and subsequently by the agent itself. However, RFT tends to favor simpler subtasks where success is more frequent, leaving many complex subtasks unsolved and out-of-distribution (OOD). This leads to diminishing returns as training progresses.

The core motivation behind EEF is the observation that failed expert trajectories on these difficult subtasks often contain valuable partial solutions, such as useful plans or critical actions (e.g., navigation steps like 'Next' or 'Back' in WebShop, or recovery actions after a mistake). Existing methods either discard these failed trajectories entirely (like RFT) or treat all actions within them as uniformly negative (like ETO or NAT), failing to leverage potentially beneficial steps.

EEF addresses this by integrating beneficial actions identified within failed expert trajectories into the training process. It operates iteratively, similar to RFT, with three main phases:

Behavior Cloning: The agent is initially fine-tuned (SFT) using only the successful expert trajectories ( $\mathcal{L}_{\text{SFT}}$ loss) to acquire basic skills.
Exploration: In each iteration, the agent explores in two ways:
- From the initial states of all training subtasks (like RFT).
- Crucially, from selected intermediate states within failed expert trajectories. To manage computational cost, EEF simulates from $M$ states sampled at regular intervals along each failed expert trajectory $\tau_e = [s_0, a_0, s_1, \dots]$ (e.g., states $s_l, s_{2l}, \dots, s_{M \times l}$ where $l = \lfloor |\tau_e| / (M+1) \rfloor$ ). All generated trajectories are collected, and successful ones are added to a repository $D^+$ .
Reinforcement Fine-tuning: This phase identifies and trains on beneficial trajectories from $D^+$ $D^{+}$ .
- Important State Selection: Two types of states are considered important:
  - Initial states $s_0$ of all subtasks (to avoid forgetting).
  - "Need recovery" states $s_{i^*}$ from failed expert trajectories. These are identified as the first state $s_{i^*}$ in the sampled sequence where the agent simulation fails ( $R(\tau_{s_{i^*}}) = 0$ ), immediately following a state $s_{i^*-l}$ where it succeeded ( $R(\tau_{s_{i^*-l}}) = 1$ ). This indicates potentially harmful expert actions between $s_{i^*-l}$ and $s_{i^*}$ .
- Solution Selection: For each important state, EEF selects at most one successful trajectory (solution) from the repository $D^+$ that starts from or passes through that state. If multiple solutions exist, the one requiring the fewest initial expert actions (i.e., the one where the agent took over earlier) is preferred.
- Selective Training: The agent is fine-tuned using the SFT loss only on the actions following the identified important state within the selected solution trajectory. Actions preceding the important state are masked out to avoid learning potentially harmful or irrelevant prior expert actions.

Experiments were conducted on challenging WebShop (11k and 3k datasets) and ScienceWorld environments using LLAMA3 8B. Baselines included GPT-3.5 Turbo, GPT-4, SFT (All/Positive), NAT, ETO, and RFT (with 1 and 6 explorations per task).

Key findings:

EEF significantly outperformed all baselines, including GPT-4 and RFT variants, achieving state-of-the-art results (e.g., 62.0% win rate on Webshop 11k vs. 53.6% for RFTx6 and 35.6% for GPT-4; 81.3 reward on SciWorld vs. 74.6 for RFT).
EEF demonstrated better utilization of navigation skills (Next/Back) compared to RFT, suggesting it mitigates the simplicity bias by learning to solve harder tasks requiring such actions.
EEF was shown to be more exploration-efficient than RFT, achieving better results with fewer simulations.
A variant, EEF GPT-3&4, which incorporated additional, cheaper trajectories generated by GPT-3.5 Turbo during exploration and training, performed even better, showing EEF's robustness and ability to leverage weaker expert data effectively.
The method also generalized well when tested with a different base model (Mistral-7B).

The paper concludes that EEF effectively leverages valuable information from failed expert trajectories, improving agent performance on complex tasks while retaining the simplicity of SFT-based fine-tuning.