This paper introduces WebRL (Qi et al., 4 Nov 2024 ), a framework designed to train capable web agents using open-source LLMs, addressing the high cost of proprietary models and the typical performance gap of open models in complex web interaction tasks. WebRL uses a self-evolving online curriculum reinforcement learning approach, demonstrating significant success rates on the WebArena-Lite benchmark.
The core problem WebRL tackles is bridging the performance gap between expensive proprietary LLM APIs (like GPT-4) and less capable open-source LLMs when used as web agents. Open LLMs often lack sufficient decision-making training data and struggle with online learning challenges. WebRL identifies and addresses three key challenges:
- Insufficiency of training tasks: Online benchmarks like WebArena provide limited evaluation tasks, insufficient for comprehensive training.
- Sparsity and cost of feedback signals: Web tasks often have long horizons (many steps) with rewards only upon final success or failure, making learning difficult. Evaluating success automatically is also challenging.
- Policy distribution drift in online learning: Online exploration and learning can lead to catastrophic forgetting of previously learned skills as the agent's policy changes.
To overcome these, WebRL integrates three main components:
1. Self-Evolving Online Curriculum
- Purpose: To continuously generate new, relevant training tasks, addressing task scarcity.
- Mechanism: In each training phase, WebRL uses instructions that the agent failed to complete in the previous phase as seeds. It employs an "in-breadth evolving" strategy (inspired by WizardLM) using a powerful LLM (like GPT-4o) to generate new, related instructions.
- Filtering: Generated tasks are filtered in two stages:
- Difficulty Filtering: The agent's critic (value network) evaluates the initial state of each potential new task. Only tasks with estimated values between 0.05 and 0.75 (moderately difficult) are kept.
- Feasibility Filtering: A separate GPT-4o prompt is used to automatically filter out tasks deemed infeasible within the WebArena environment based on predefined rules.
- Outcome: This creates a dynamic, progressively challenging set of tasks tailored to the agent's current capabilities, facilitating gradual learning. Figure 9 shows examples of how instructions evolve.
2. Outcome-Supervised Reward Model (ORM)
- Purpose: To provide a feedback signal for task completion in the absence of fine-grained environment rewards.
- Implementation: An LLM is trained to act as a binary classifier. It takes the task instruction, the agent's action history, and the HTML of the final state as input and outputs "YES" or "NO" to indicate task success.
- Training: The ORM is trained on trajectories from WebArena-Lite's training set (augmented with rewrites and variable changes) and rollouts from baseline methods, using the environment's ground-truth reward function for labels. The paper reports ~80% accuracy for their ORM (Llama-3.1-8B based), outperforming GPT-4 based methods (Table 3).
- Usage: The ORM provides the reward signal (1 for success, 0 for failure) used in the RL training loop for newly generated tasks.
3. Adaptive Reinforcement Learning Strategies
- Purpose: To optimize the agent's policy effectively using sparse rewards and prevent policy drift.
- Algorithm: WebRL employs an off-policy RL algorithm based on maximum entropy RL principles.
- KL-Constrained Policy Update: The core objective function (Eq 1) includes a KL divergence term constraining the current policy () from deviating too far from a reference policy (), which is the policy from the previous training phase.
This leads to a loss function (Eq 5) that minimizes the squared error between the scaled log-probability ratio and the advantage:
The parameter controls the strength of the KL constraint, balancing learning new tasks and retaining old knowledge.
- Advantage Estimation: Generalized Advantage Estimation (GAE, Eq 8) is used, tailored for sparse binary rewards by focusing on next-step and final-step advantages (). The value function (critic, ) is trained using a cross-entropy loss (Eq 7) appropriate for binary outcomes.
- Experience Replay Buffer with Actor Confidence Filtering:
- The replay buffer stores only successful trajectories from previous phases.
- When sampling from the buffer for training, experiences are filtered based on the current actor's perplexity on the stored actions. Only data with perplexity between 1/0.95 and 1/0.5 (moderately difficult for the current actor) is used. This prevents overfitting to overly easy past examples and avoids struggling with overly hard ones, ensuring data relevance.
Implementation Details
- Environment: WebArena, evaluated on WebArena-Lite (165 tasks, 5 websites: Reddit, Gitlab, CMS, Map, OSS).
- Models: Llama-3.1 (8B, 70B) and GLM-4-9B. RL training starts from models fine-tuned using Supervised Fine-Tuning (SFT) on the WebArena-Lite training set.
- Input/Output: The agent receives the instruction, action history, and simplified HTML (with clickable elements tagged). It outputs actions like
Click(element_id)
,Type(element_id, text)
,Scroll(direction)
, etc. (See Appendix §B and Fig 9). - Training: The process is iterative (Algorithm 1). Each phase involves task generation/filtering, rollouts, ORM evaluation, buffer update, data sampling (rollouts + filtered buffer), and actor/critic training. Hyperparameters are provided in Appendix Table 4.
Key Results
- WebRL significantly boosts the performance of open LLMs. Llama-3.1-8B improves from 4.8% to 42.4% success rate (SR) on WebArena-Lite. GLM-4-9B improves from 6.1% to 43.0%. Llama-3.1-70B reaches 49.1%.
- These results surpass strong proprietary baselines like GPT-4-Turbo (17.6%) and GPT-4o (13.9%), and previous open-source SOTA (AutoWebGLM, 18.2%).
- WebRL outperforms other RL methods like AWR and DigiRL, attributed mainly to the self-evolving curriculum adapting task difficulty, whereas DigiRL uses a fixed task set.
- Analysis shows WebRL improves performance on longer tasks (Fig 4), more complex tasks (Fig 6), and reduces specific errors like "Get Stuck Midway" and "Fail to Recover" (Fig 3).
- Ablation studies (Fig 5) confirm the importance of the curriculum, KL-constrained updates, and the filtered replay buffer. Filtering the replay buffer by perplexity (Table 2) and using an appropriate (Fig 8) are crucial.
Practical Implications
WebRL provides a concrete framework and practical techniques for training effective web agents using open-source LLMs. Its components address common challenges in online RL for agents:
- The self-evolving curriculum offers a way to generate tasks dynamically.
- The ORM provides a solution for environments with sparse or unavailable reward functions.
- The KL-constrained RL update with filtered replay offers a method to stabilize online learning and mitigate catastrophic forgetting.
The public release of code, models, and data associated with WebRL facilitates its adoption and further research in building more accessible and powerful autonomous web agents.