Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoWebGLM: A Large Language Model-based Web Navigating Agent (2404.03648v2)

Published 4 Apr 2024 in cs.CL
AutoWebGLM: A Large Language Model-based Web Navigating Agent

Abstract: LLMs have fueled many intelligent web agents, but most existing ones perform far from satisfying in real-world web navigation tasks due to three factors: (1) the complexity of HTML text data (2) versatility of actions on webpages, and (3) task difficulty due to the open-domain nature of the web. In light of these challenges, we develop the open AutoWebGLM based on ChatGLM3-6B. AutoWebGLM can serve as a powerful automated web navigation agent that outperform GPT-4. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages with vital information preserved succinctly. We then employ a hybrid human-AI method to build web browsing data for curriculum training. Finally, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For comprehensive evaluation, we establish a bilingual benchmark -- AutoWebBench -- for real-world web navigation tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, demonstrating its potential to tackle challenging tasks in real environments. Related code, model, and data are released at \url{https://github.com/THUDM/AutoWebGLM}.

AutoWebGLM: Innovations and Evaluations in AI-Powered Web Navigation Agents

Introduction to AutoWebGLM

The development of AutoWebGLM introduces a significant enhancement in web navigation agent capabilities, employing the ChatGLM3-6B model as its backbone. This agent surpasses previous benchmarks, including GPT-4, in automated web navigation by embracing a tailored approach to webpage understanding and interaction. The model's unique contributions lie in its sophisticated handling of the complexities associated with web navigation, including diverse action spaces, HTML simplification for efficient processing, and the generation of high-quality training trajectories.

Challenges Addressed

AutoWebGLM's design directly confronts the primary hurdles in web navigation automation:

  • Unified Action Space: It establishes a comprehensive action space that enables seamless interactions across a plethora of websites.
  • HTML Simplification: By implementing an algorithm that condenses HTML content while preserving essential information, AutoWebGLM ensures the model's operability under the constraint of token length.
  • High-quality Training Trajectories: Through a combination of model-assisted and manual annotation methods, AutoWebGLM generates a dataset conducive to training robust web navigating agents capable of accurate inference and error correction.

Methodological Insights

The foundation of AutoWebGLM lies in its methodological innovations:

  • HTML Representation: The system incorporates an HTML simplification algorithm inspired by human web browsing patterns, significantly reducing the complexity and verbosity of webpages for model comprehension.
  • Hybrid Human-AI Data Construction: This approach enables the rapid assembly of a rich dataset that the model uses for training, refining its understanding of web operations and decisions.
  • Curriculum Learning and Reinforcement Approaches: Sequential training strategies involving curriculum learning, reinforcement learning (RL), and rejection sampling finetuning (RFT) are employed to progressively enhance the model's performance across various stages of web interaction.

Dataset and Benchmark Development

A notable contribution of AutoWebGLM's development is the construction of AutoWebBench, a bilingual (English and Chinese) benchmark that addresses the need for comprehensive evaluation tools in web navigation research. This benchmark is designed to assess an agent's performance in navigating and interacting with real-world webpages, offering insights into the practical applicability of AI-powered web agents.

Empirical Evaluations and Findings

Extensive testing of AutoWebGLM across multiple benchmarks, including the newly developed AutoWebBench, reveals its superior performance compared to existing LLM-based web navigating agents. The model demonstrates not only significant improvements in various web navigation tasks but also highlights areas for further research and development.

  • Performance Metrics: AutoWebGLM achieves high success rates across diverse web navigation benchmarks, showcasing its robustness and versatility.
  • Challenges in Real-World Navigation: Despite its achievements, AutoWebGLM's performance also underlines the complexity of real-world web navigation and the need for ongoing enhancements in model training and environmental understanding.

Concluding Remarks

The introduction of AutoWebGLM marks a pivotal advancement in the field of AI-powered web navigation. By addressing fundamental challenges and integrating innovative training methodologies, AutoWebGLM sets a new standard for the development of intelligent web navigating agents. The AutoWebBench benchmark further enriches research resources, paving the way for future innovations in AI-driven web interactions. As web navigation continues to evolve, AutoWebGLM represents a significant step forward in harnessing the potential of LLMs to navigate the vast expanse of the internet effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Anthropic. Model card and evaluations for claude models. 2023.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  6. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  7. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013.
  8. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  9. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  10. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  11. Middleware for llms: Tools are instrumental for language agents in complex environments. arXiv preprint arXiv:2402.14672, 2024.
  12. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  13. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022.
  14. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  16. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR, 2022.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  18. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  19. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
  20. Webglm: Towards an efficient web-enhanced question answering system with human preferences. arXiv preprint arXiv:2306.07906, 2023.
  21. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023.
  22. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023.
  23. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  24. Generating training data with language models: Towards zero-shot language understanding. Advances in Neural Information Processing Systems, 35:462–477, 2022.
  25. Cc-net: Image complexity guided network compression for biomedical image segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 57–60. IEEE, 2019.
  26. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  27. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  28. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.
  29. OpenAI. Introducing chatgpt. 2022.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  32. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  33. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  34. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  37. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  38. A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021.
  39. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  40. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  41. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  42. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  43. Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830, 2023.
  44. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  45. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  47. Webarena: A realistic web environment for building autonomous agents. In Second Agent Learning in Open-Endedness Workshop, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Hanyu Lai (11 papers)
  2. Xiao Liu (402 papers)
  3. Iat Long Iong (4 papers)
  4. Shuntian Yao (4 papers)
  5. Yuxuan Chen (80 papers)
  6. Pengbo Shen (1 paper)
  7. Hao Yu (195 papers)
  8. Hanchen Zhang (5 papers)
  9. Xiaohan Zhang (78 papers)
  10. Yuxiao Dong (119 papers)
  11. Jie Tang (302 papers)
Citations (25)
Youtube Logo Streamline Icon: https://streamlinehq.com