- The paper introduces the EEF technique that integrates beneficial actions from failed expert trajectories to improve LLM tuning on complex tasks.
- EEF outperforms baselines by achieving a 62.0% win rate on WebShop 11k and an 81.3 reward on ScienceWorld, demonstrating superior exploration efficiency.
- The method employs behavior cloning, targeted exploration of intermediate states, and selective reinforcement fine-tuning to mitigate simplicity bias.
This paper introduces Exploring Expert Failures (EEF), a novel fine-tuning method for LLM agents designed to improve performance on complex tasks where even expert models (like GPT-4) often fail.
The standard approach, Rejection Sampling Fine-Tuning (RFT), fine-tunes smaller LLMs on successful trajectories generated by an expert model and subsequently by the agent itself. However, RFT tends to favor simpler subtasks where success is more frequent, leaving many complex subtasks unsolved and out-of-distribution (OOD). This leads to diminishing returns as training progresses.
The core motivation behind EEF is the observation that failed expert trajectories on these difficult subtasks often contain valuable partial solutions, such as useful plans or critical actions (e.g., navigation steps like 'Next' or 'Back' in WebShop, or recovery actions after a mistake). Existing methods either discard these failed trajectories entirely (like RFT) or treat all actions within them as uniformly negative (like ETO or NAT), failing to leverage potentially beneficial steps.
EEF addresses this by integrating beneficial actions identified within failed expert trajectories into the training process. It operates iteratively, similar to RFT, with three main phases:
- Behavior Cloning: The agent is initially fine-tuned (SFT) using only the successful expert trajectories (LSFT loss) to acquire basic skills.
- Exploration: In each iteration, the agent explores in two ways:
- From the initial states of all training subtasks (like RFT).
- Crucially, from selected intermediate states within failed expert trajectories. To manage computational cost, EEF simulates from M states sampled at regular intervals along each failed expert trajectory τe=[s0,a0,s1,…] (e.g., states sl,s2l,…,sM×l where l=⌊∣τe∣/(M+1)⌋). All generated trajectories are collected, and successful ones are added to a repository D+.
- Reinforcement Fine-tuning: This phase identifies and trains on beneficial trajectories from D+.
- Important State Selection: Two types of states are considered important:
- Initial states s0 of all subtasks (to avoid forgetting).
- "Need recovery" states si∗ from failed expert trajectories. These are identified as the first state si∗ in the sampled sequence where the agent simulation fails (R(τsi∗)=0), immediately following a state si∗−l where it succeeded (R(τsi∗−l)=1). This indicates potentially harmful expert actions between si∗−l and si∗.
- Solution Selection: For each important state, EEF selects at most one successful trajectory (solution) from the repository D+ that starts from or passes through that state. If multiple solutions exist, the one requiring the fewest initial expert actions (i.e., the one where the agent took over earlier) is preferred.
- Selective Training: The agent is fine-tuned using the SFT loss only on the actions following the identified important state within the selected solution trajectory. Actions preceding the important state are masked out to avoid learning potentially harmful or irrelevant prior expert actions.
Experiments were conducted on challenging WebShop (11k and 3k datasets) and ScienceWorld environments using LLAMA3 8B. Baselines included GPT-3.5 Turbo, GPT-4, SFT (All/Positive), NAT, ETO, and RFT (with 1 and 6 explorations per task).
Key findings:
- EEF significantly outperformed all baselines, including GPT-4 and RFT variants, achieving state-of-the-art results (e.g., 62.0% win rate on Webshop 11k vs. 53.6% for RFTx6 and 35.6% for GPT-4; 81.3 reward on SciWorld vs. 74.6 for RFT).
- EEF demonstrated better utilization of navigation skills (Next/Back) compared to RFT, suggesting it mitigates the simplicity bias by learning to solve harder tasks requiring such actions.
- EEF was shown to be more exploration-efficient than RFT, achieving better results with fewer simulations.
- A variant, EEF GPT-3&4, which incorporated additional, cheaper trajectories generated by GPT-3.5 Turbo during exploration and training, performed even better, showing EEF's robustness and ability to leverage weaker expert data effectively.
- The method also generalized well when tested with a different base model (Mistral-7B).
The paper concludes that EEF effectively leverages valuable information from failed expert trajectories, improving agent performance on complex tasks while retaining the simplicity of SFT-based fine-tuning.