Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoGLM: Autonomous Foundation Agents for GUIs (2411.00820v1)

Published 28 Oct 2024 in cs.HC, cs.AI, cs.CL, and cs.LG

Abstract: We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs). While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence. This limitation underscores the importance of developing foundation agents capable of learning through autonomous environmental interactions by reinforcing existing models. Focusing on Web Browser and Phone as representative GUI scenarios, we have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our approach integrates a comprehensive suite of techniques and infrastructures to create deployable agent systems suitable for user delivery. Through this development, we have derived two key insights: First, the design of an appropriate "intermediate interface" for GUI control is crucial, enabling the separation of planning and grounding behaviors, which require distinct optimization for flexibility and accuracy respectively. Second, we have developed a novel progressive training framework that enables self-evolving online curriculum reinforcement learning for AutoGLM. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains. For web browsing, AutoGLM achieves a 55.2% success rate on VAB-WebArena-Lite (improving to 59.1% with a second attempt) and 96.2% on OpenTable evaluation tasks. In Android device control, AutoGLM attains a 36.2% success rate on AndroidLab (VAB-Mobile) and 89.7% on common tasks in popular Chinese APPs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Anthropic. Claude 3.5 sonnet, 2024.
  2. Anthropic. Introducing the next generation of claude, 2024.
  3. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896, 2024.
  4. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
  5. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  6. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.
  9. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  10. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  11. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.
  12. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024.
  13. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  14. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024.
  15. Openwebagent: An open toolkit to enable web agents on large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 72–81, 2024.
  16. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2023.
  17. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024.
  18. Avalonbench: Evaluating llms playing the game of avalon. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  19. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  20. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  21. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023.
  22. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
  23. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024.
  24. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  25. OpenAI. New models and developer products announced at devday, 2023.
  26. OpenAI. Hello gpt-4o, 2024.
  27. R. OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  28. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023.
  29. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024.
  30. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. 2024.
  31. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024.
  32. Android in the wild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36, 2024.
  33. Planning in natural language improves llm search for code generation. arXiv preprint arXiv:2409.03733, 2024.
  34. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  35. Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
  36. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024.
  37. Androidlab: Training and systematic benchmarking of android autonomous agents. 2024.
  38. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  39. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  40. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  41. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  42. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  43. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts. arXiv preprint arXiv:2405.04520, 2024.
  44. Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration. arXiv preprint arXiv:2408.15978, 2024.
  45. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  46. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023.
Citations (1)

Summary

  • The paper presents AutoGLM, an extension of the ChatGLM model designed as an autonomous foundation agent capable of interacting with graphical user interfaces like web browsers and Android devices.
  • AutoGLM introduces an intermediate interface to separate planning and grounding, alongside a progressive online reinforcement learning framework for self-evolving training and failure recovery.
  • Evaluations show AutoGLM achieves success rates up to 59.1% on VAB-WebArena-Lite and 89.7% on human-evaluated common Android tasks, demonstrating significant improvement over existing models like GPT-4o and Claude-3.5-Sonnet.

AutoGLM: Autonomous Foundation Agents for GUIs

The research paper presents AutoGLM, a new extension to the ChatGLM model family, emphasizing the development of autonomous foundation agents for graphical user interfaces (GUIs). The paper addresses one of the significant shortcomings of contemporary foundation models: the integration of decision-making capabilities in dynamic environments through autonomous interaction with GUIs. This advancement aims to overcome the limitations hindering the progress toward artificial general intelligence (AGI).

Key Concepts and Methodology

AutoGLM is introduced as a foundation agent system specifically designed for real-world GUI interactions, focusing on Web Browsers and Android devices as primary use scenarios. The paper introduces several core methodological innovations:

  1. Intermediate Interface Design: The work presents a novel approach to disentangle planning and grounding behaviors in GUI control. By creating an "intermediate interface", the authors achieve optimized flexibility in planning and enhanced accuracy in grounding interactions, facilitating error recovery and action precision.
  2. Progressive Training Framework: A self-evolving online curriculum reinforcement learning (RL) mechanism is incorporated. This framework focuses on progressively training agents with varying task complexities and emphasizes learning from failures, which is challenging to achieve through offline training alone.

The authors provide a broad suite of techniques required for creating capable foundation agents, utilizing behavior cloning, pre-training enhancement, curriculum learning, and reinforcement learning, among others. Such a comprehensive approach enables AutoGLM to effectively tackle both planning and grounding challenges often encountered in GUI contexts.

Results and Evaluation

The research findings demonstrate AutoGLM's effectiveness across multiple domains of GUI interactions:

  • Web Browsing: AutoGLM achieved a task success rate (SR) of 55.2% on the VAB-WebArena-Lite benchmark, with potential improvement to 59.1% upon a second attempt. Notably, it obtained a 96.2% SR on OpenTable real-world evaluation tasks.
  • Android Device Control: AutoGLM showed a 36.2% SR on AndroidLab (previously VAB-Mobile) and 89.7% SR in human evaluations of common tasks within popular Chinese apps.

These results indicate significant performance improvements over existing models such as GPT-4o and Claude-3.5-Sonnet. The AutoGLM system is also made accessible through the Qingyan Browser Plugin for web applications and an Android AccessibilityService interface for device control testing.

Implications and Future Directions

The implications of this research are multifaceted. Practically, AutoGLM represents substantial progress in the development of user-interactive, deployable intelligent agents capable of automating complex tasks across vastly used digital interfaces. Theoretically, this work contributes to the broader understanding of integrating decision-making capabilities in AI, moving closer to realizing AGI aspirations.

Future developments, as suggested by the authors, will focus on further refining the interface design to enhance modularity and scalability, as well as improving reinforcement learning techniques to boost agent adaptability and learning efficiency in diverse environments. Additional experiments with larger and more diverse datasets could further improve AutoGLM's proficiency, transitioning from prototype stages toward more robust, real-world applications.

In summary, the paper demonstrates the potential of combining advanced machine learning methods with robust system design to achieve functional interactive agents, marking a noteworthy advancement in GUI-based automation and interaction.