Large Language Models Can Self-Improve At Web Agent Tasks (2405.20309v2)

Published 30 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. LLMs have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31\% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

PDF Abstract

Self-Improvement in Long-Horizon Web Agent Tasks Using LLMs

Abstract and Contribution Summary

The paper investigates the capability of LLMs to self-improve their performance in complex, long-horizon tasks within the WebArena benchmark environment. Utilizing synthetic training data, generated by the model itself, the research demonstrates a notable 31% improvement in task completion rates compared to the baseline model. Additionally, novel evaluation metrics are proposed to assess the multi-dimensional improvements achieved through self-improvement techniques. The paper specifically deals with fine-tuning on three distinct synthetic training data mixtures and suggests that significant advancements can be made without relying on external supervised datasets.

Introduction and Problem Statement

The challenge of training models to function as effective agents in intricate environments such as web browsers has been a longstanding issue, primarily due to a scarcity of task-specific training data. While LLMs have excelled in traditional NLP tasks through zero-shot and few-shot learning paradigms, these methods are inadequate for equipping LLMs to undertake complex, multi-step, long-horizon interactions with their environments.

The authors pivot to self-improvement techniques, which leverage the model's own generations to create synthetic data. This paradigm minimizes the dependency on labeled data and reduces associated costs. This paper pioneers in applying these techniques to the WebArena benchmark, designed as a rigorous test for LLMs tasked with navigating and performing actions on web interfaces to meet specified objectives.

Methodology

The core of the research rests on employing synthetic data for fine-tuning, partitioned into in-domain and out-of-domain examples. The paper utilizes the WebArena benchmark, focusing on performance enhancement through multiple data mixtures:

Mixture A: Only in-domain synthetic examples.
Mixture B: Combination of in-domain and out-of-domain synthetic examples.
Mixture C: Only out-of-domain synthetic examples.

A fundamental aspect is the formulation of the WebArena environment into a Markov decision process to collect and filter plausible trajectories (a sequence of actions). These trajectories serve as a synthetic dataset that omits low-quality examples through a rigorous unsupervised filtering mechanism. Out-of-domain examples further diversify the learning material by generating novel tasks and trajectories, which are markedly different from those in the benchmark, ensuring the agent's robustness and versatility.

Evaluation Metrics

To objectively evaluate the self-improvement, the research introduced additional metrics:

Functional Correctness Score: Measures the binary task completion rate.
Capability Score: Assesses the acquisition and retention of unique capabilities, taking into account the similarity of task templates.
VERTEX $_{\text{DTW}}$ Score: An adaptation of the VERTEX score using dynamic time warping to align and compare variable-length trajectories, enhancing the measure's sensitivity to incremental improvements and the quality of trajectories.

Experiments and Results

The experiments explored various self-improvement fine-tuning settings:

Baseline Performance: Compared against a trivial agent and the base model.
Improvement Evaluation: The fine-tuned models displayed substantial gains in both functional correctness and capability scores. Specifically, Mixture B showed the highest improvement with a 31% increase in task completion.
Iterative Self-Improvement: Additional rounds of self-improvement were tested, although results showed diminishing returns, suggesting the initial improvements are the most significant.

Discussion

The results underscore several critical insights:

Self-Improvement Efficacy: Both in-domain and out-of-domain synthetic examples contribute to improved performance, with Mixture B (a combination of both) offering the most substantial gains.
Capability Acquisition: Models demonstrated the ability to acquire new capabilities, albeit with some trade-offs, such as an increase in the number of invalid actions.
Trajectory Quality: While Mixture C yielded unique capabilities, it led to longer and somewhat lower quality trajectories compared to other mixtures.
Iterative Gains: Successive rounds of self-improvement yielded marginal improvements, revealing limitations in the iterative approach using synthetic data.

Broader Impacts and Future Directions

The techniques showcased offer a promising path towards elevating LLM performance in complex agent tasks without heavy reliance on labeled datasets. This has broad implications for AI applications requiring sophisticated environment interactions, such as automated web browsing, virtual assistants, and more.

Future research could expand the scope to more diverse and larger benchmarks, incorporate human-supervised filtering to further enhance synthetic data quality, and explore the integration of multi-modal inputs for more robust agent behavior.

Conclusion

This paper illustrates that LLMs can significantly enhance their agent capabilities through self-improvement techniques, especially within the context of long-horizon, multi-step tasks like those in WebArena. The findings hold potential for advancing practical applications of LLMs while opening pathways for further refinement and broadening the scope of self-improvement methodologies.