Self-Improvement in Long-Horizon Web Agent Tasks Using LLMs
Abstract and Contribution Summary
The paper investigates the capability of LLMs to self-improve their performance in complex, long-horizon tasks within the WebArena benchmark environment. Utilizing synthetic training data, generated by the model itself, the research demonstrates a notable 31% improvement in task completion rates compared to the baseline model. Additionally, novel evaluation metrics are proposed to assess the multi-dimensional improvements achieved through self-improvement techniques. The paper specifically deals with fine-tuning on three distinct synthetic training data mixtures and suggests that significant advancements can be made without relying on external supervised datasets.
Introduction and Problem Statement
The challenge of training models to function as effective agents in intricate environments such as web browsers has been a longstanding issue, primarily due to a scarcity of task-specific training data. While LLMs have excelled in traditional NLP tasks through zero-shot and few-shot learning paradigms, these methods are inadequate for equipping LLMs to undertake complex, multi-step, long-horizon interactions with their environments.
The authors pivot to self-improvement techniques, which leverage the model's own generations to create synthetic data. This paradigm minimizes the dependency on labeled data and reduces associated costs. This paper pioneers in applying these techniques to the WebArena benchmark, designed as a rigorous test for LLMs tasked with navigating and performing actions on web interfaces to meet specified objectives.
Methodology
The core of the research rests on employing synthetic data for fine-tuning, partitioned into in-domain and out-of-domain examples. The paper utilizes the WebArena benchmark, focusing on performance enhancement through multiple data mixtures:
- Mixture A: Only in-domain synthetic examples.
- Mixture B: Combination of in-domain and out-of-domain synthetic examples.
- Mixture C: Only out-of-domain synthetic examples.
A fundamental aspect is the formulation of the WebArena environment into a Markov decision process to collect and filter plausible trajectories (a sequence of actions). These trajectories serve as a synthetic dataset that omits low-quality examples through a rigorous unsupervised filtering mechanism. Out-of-domain examples further diversify the learning material by generating novel tasks and trajectories, which are markedly different from those in the benchmark, ensuring the agent's robustness and versatility.
Evaluation Metrics
To objectively evaluate the self-improvement, the research introduced additional metrics:
- Functional Correctness Score: Measures the binary task completion rate.
- Capability Score: Assesses the acquisition and retention of unique capabilities, taking into account the similarity of task templates.
- VERTEX Score: An adaptation of the VERTEX score using dynamic time warping to align and compare variable-length trajectories, enhancing the measure's sensitivity to incremental improvements and the quality of trajectories.
Experiments and Results
The experiments explored various self-improvement fine-tuning settings:
- Baseline Performance: Compared against a trivial agent and the base model.
- Improvement Evaluation: The fine-tuned models displayed substantial gains in both functional correctness and capability scores. Specifically, Mixture B showed the highest improvement with a 31% increase in task completion.
- Iterative Self-Improvement: Additional rounds of self-improvement were tested, although results showed diminishing returns, suggesting the initial improvements are the most significant.
Discussion
The results underscore several critical insights:
- Self-Improvement Efficacy: Both in-domain and out-of-domain synthetic examples contribute to improved performance, with Mixture B (a combination of both) offering the most substantial gains.
- Capability Acquisition: Models demonstrated the ability to acquire new capabilities, albeit with some trade-offs, such as an increase in the number of invalid actions.
- Trajectory Quality: While Mixture C yielded unique capabilities, it led to longer and somewhat lower quality trajectories compared to other mixtures.
- Iterative Gains: Successive rounds of self-improvement yielded marginal improvements, revealing limitations in the iterative approach using synthetic data.
Broader Impacts and Future Directions
The techniques showcased offer a promising path towards elevating LLM performance in complex agent tasks without heavy reliance on labeled datasets. This has broad implications for AI applications requiring sophisticated environment interactions, such as automated web browsing, virtual assistants, and more.
Future research could expand the scope to more diverse and larger benchmarks, incorporate human-supervised filtering to further enhance synthetic data quality, and explore the integration of multi-modal inputs for more robust agent behavior.
Conclusion
This paper illustrates that LLMs can significantly enhance their agent capabilities through self-improvement techniques, especially within the context of long-horizon, multi-step tasks like those in WebArena. The findings hold potential for advancing practical applications of LLMs while opening pathways for further refinement and broadening the scope of self-improvement methodologies.