Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (2411.02337v2)

Published 4 Nov 2024 in cs.CL
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Abstract: LLMs have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

This paper introduces WebRL (Qi et al., 4 Nov 2024 ), a framework designed to train capable web agents using open-source LLMs, addressing the high cost of proprietary models and the typical performance gap of open models in complex web interaction tasks. WebRL uses a self-evolving online curriculum reinforcement learning approach, demonstrating significant success rates on the WebArena-Lite benchmark.

The core problem WebRL tackles is bridging the performance gap between expensive proprietary LLM APIs (like GPT-4) and less capable open-source LLMs when used as web agents. Open LLMs often lack sufficient decision-making training data and struggle with online learning challenges. WebRL identifies and addresses three key challenges:

  1. Insufficiency of training tasks: Online benchmarks like WebArena provide limited evaluation tasks, insufficient for comprehensive training.
  2. Sparsity and cost of feedback signals: Web tasks often have long horizons (many steps) with rewards only upon final success or failure, making learning difficult. Evaluating success automatically is also challenging.
  3. Policy distribution drift in online learning: Online exploration and learning can lead to catastrophic forgetting of previously learned skills as the agent's policy changes.

To overcome these, WebRL integrates three main components:

1. Self-Evolving Online Curriculum

  • Purpose: To continuously generate new, relevant training tasks, addressing task scarcity.
  • Mechanism: In each training phase, WebRL uses instructions that the agent failed to complete in the previous phase as seeds. It employs an "in-breadth evolving" strategy (inspired by WizardLM) using a powerful LLM (like GPT-4o) to generate new, related instructions.
  • Filtering: Generated tasks are filtered in two stages:
    • Difficulty Filtering: The agent's critic (value network) evaluates the initial state of each potential new task. Only tasks with estimated values between 0.05 and 0.75 (moderately difficult) are kept.
    • Feasibility Filtering: A separate GPT-4o prompt is used to automatically filter out tasks deemed infeasible within the WebArena environment based on predefined rules.
  • Outcome: This creates a dynamic, progressively challenging set of tasks tailored to the agent's current capabilities, facilitating gradual learning. Figure 9 shows examples of how instructions evolve.

2. Outcome-Supervised Reward Model (ORM)

  • Purpose: To provide a feedback signal for task completion in the absence of fine-grained environment rewards.
  • Implementation: An LLM is trained to act as a binary classifier. It takes the task instruction, the agent's action history, and the HTML of the final state as input and outputs "YES" or "NO" to indicate task success.
  • Training: The ORM is trained on trajectories from WebArena-Lite's training set (augmented with rewrites and variable changes) and rollouts from baseline methods, using the environment's ground-truth reward function for labels. The paper reports ~80% accuracy for their ORM (Llama-3.1-8B based), outperforming GPT-4 based methods (Table 3).
  • Usage: The ORM provides the reward signal (1 for success, 0 for failure) used in the RL training loop for newly generated tasks.

3. Adaptive Reinforcement Learning Strategies

  • Purpose: To optimize the agent's policy effectively using sparse rewards and prevent policy drift.
  • Algorithm: WebRL employs an off-policy RL algorithm based on maximum entropy RL principles.
  • KL-Constrained Policy Update: The core objective function (Eq 1) includes a KL divergence term constraining the current policy (πθ\pi_\theta) from deviating too far from a reference policy (πref\pi_\text{ref}), which is the policy from the previous training phase.

    maxπθEIρ(I),atπθ(st)[t=0T(r(st,at,I)+βlogπref(atst,I))+βH(πθ)]\max_{\pi_\theta} E_{I \sim \rho(I), a_t \sim \pi_\theta(\cdot|s_t)} \left[ \sum_{t=0}^{T} \left( r(s_t, a_t, I) + \beta \log \pi_{\text{ref}}(a_t|s_t, I) \right) + \beta \mathcal{H}(\pi_\theta) \right]

    This leads to a loss function (Eq 5) that minimizes the squared error between the scaled log-probability ratio and the advantage:

    L(πθ)=Eν[(βlogπθ(as,I)πref(as,I)A(s,a,I))2]\mathcal{L}(\pi_\theta) = E_{\nu} \left[ \left( \beta \log \frac{\pi_\theta(a|s, I)}{\pi_{\text{ref}}(a|s, I)} - A^*(s, a, I) \right)^2 \right]

    The parameter β\beta controls the strength of the KL constraint, balancing learning new tasks and retaining old knowledge.

  • Advantage Estimation: Generalized Advantage Estimation (GAE, Eq 8) is used, tailored for sparse binary rewards by focusing on next-step and final-step advantages (λ=0.5\lambda=0.5). The value function (critic, VV) is trained using a cross-entropy loss (Eq 7) appropriate for binary outcomes.
  • Experience Replay Buffer with Actor Confidence Filtering:
    • The replay buffer stores only successful trajectories from previous phases.
    • When sampling from the buffer for training, experiences are filtered based on the current actor's perplexity on the stored actions. Only data with perplexity between 1/0.95 and 1/0.5 (moderately difficult for the current actor) is used. This prevents overfitting to overly easy past examples and avoids struggling with overly hard ones, ensuring data relevance.

Implementation Details

  • Environment: WebArena, evaluated on WebArena-Lite (165 tasks, 5 websites: Reddit, Gitlab, CMS, Map, OSS).
  • Models: Llama-3.1 (8B, 70B) and GLM-4-9B. RL training starts from models fine-tuned using Supervised Fine-Tuning (SFT) on the WebArena-Lite training set.
  • Input/Output: The agent receives the instruction, action history, and simplified HTML (with clickable elements tagged). It outputs actions like Click(element_id), Type(element_id, text), Scroll(direction), etc. (See Appendix §B and Fig 9).
  • Training: The process is iterative (Algorithm 1). Each phase involves task generation/filtering, rollouts, ORM evaluation, buffer update, data sampling (rollouts + filtered buffer), and actor/critic training. Hyperparameters are provided in Appendix Table 4.

Key Results

  • WebRL significantly boosts the performance of open LLMs. Llama-3.1-8B improves from 4.8% to 42.4% success rate (SR) on WebArena-Lite. GLM-4-9B improves from 6.1% to 43.0%. Llama-3.1-70B reaches 49.1%.
  • These results surpass strong proprietary baselines like GPT-4-Turbo (17.6%) and GPT-4o (13.9%), and previous open-source SOTA (AutoWebGLM, 18.2%).
  • WebRL outperforms other RL methods like AWR and DigiRL, attributed mainly to the self-evolving curriculum adapting task difficulty, whereas DigiRL uses a fixed task set.
  • Analysis shows WebRL improves performance on longer tasks (Fig 4), more complex tasks (Fig 6), and reduces specific errors like "Get Stuck Midway" and "Fail to Recover" (Fig 3).
  • Ablation studies (Fig 5) confirm the importance of the curriculum, KL-constrained updates, and the filtered replay buffer. Filtering the replay buffer by perplexity (Table 2) and using an appropriate β\beta (Fig 8) are crucial.

Practical Implications

WebRL provides a concrete framework and practical techniques for training effective web agents using open-source LLMs. Its components address common challenges in online RL for agents:

  • The self-evolving curriculum offers a way to generate tasks dynamically.
  • The ORM provides a solution for environments with sparse or unavailable reward functions.
  • The KL-constrained RL update with filtered replay offers a method to stabilize online learning and mitigate catastrophic forgetting.

The public release of code, models, and data associated with WebRL facilitates its adoption and further research in building more accessible and powerful autonomous web agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896, 2024.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  4. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023.
  5. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  6. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  7. Stop regressing: Training value functions via classification for scalable deep rl. arXiv preprint arXiv:2403.03950, 2024.
  8. Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Advances in neural information processing systems, 34:25502–25515, 2021.
  9. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.
  10. Middleware for llms: Tools are instrumental for language agents in complex environments. arXiv preprint arXiv:2402.14672, 2024.
  11. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  12. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  13. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  14. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14281–14290, 2024.
  15. Openwebagent: An open toolkit to enable web agents on large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  72–81, 2024.
  16. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
  17. Autowebglm: A large language model-based web navigating agent. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pp.  5295–5306, 2024.
  18. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023a.
  19. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024.
  20. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960, 2023b.
  21. Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, 1999.
  22. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  23. Autonomous evaluation and refinement of digital agents. In First Conference on Language Modeling, 2024.
  24. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
  25. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  26. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024.
  27. From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a.
  28. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.
  29. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36, 2024.
  30. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  31. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  32. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024.
  33. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  34. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  35. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  36. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  37. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  38. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  39. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024.
  40. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  41. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. arXiv preprint arXiv:2405.10292, 2024.
  42. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
  43. Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024b.
  44. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713, 2024c.
  45. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339, 2024d.
  46. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024e.
  47. Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. arXiv preprint arXiv:2408.15978, 2024f.
  48. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  49. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  50. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
  51. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023a.
  52. Llm as dba. arXiv preprint arXiv:2308.05481, 2023b.
  53. Archer: Training language model agents via hierarchical multi-turn rl. arXiv preprint arXiv:2402.19446, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Zehan Qi (13 papers)
  2. Xiao Liu (402 papers)
  3. Iat Long Iong (4 papers)
  4. Hanyu Lai (11 papers)
  5. Xueqiao Sun (3 papers)
  6. Xinyue Yang (6 papers)
  7. Jiadai Sun (16 papers)
  8. Yu Yang (213 papers)
  9. Shuntian Yao (4 papers)
  10. Tianjie Zhang (10 papers)
  11. Wei Xu (535 papers)
  12. Jie Tang (302 papers)
  13. Yuxiao Dong (119 papers)
  14. Wenyi Zhao (10 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com