Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2410.23218v1)

Published 30 Oct 2024 in cs.CL, cs.CV, and cs.HC
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Abstract: Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-LLMs (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Insights into OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

The paper "OS-ATLAS: A Foundation Action Model for Generalist GUI Agents" presents a comprehensive paper on improving graphical user interface (GUI) agents through innovative approaches in data collection and modeling strategies. This research is particularly relevant given the current reliance on closed-source Vision-LLMs (VLMs) like GPT-4o and GeminiProVision in the field.

Summary of Key Contributions

The authors have developed OS-Atlas, an open-source foundational action model that addresses critical issues in GUI grounding and OOD generalization. The paper identifies two primary shortcomings of existing VLM-based GUI action models: inadequacy in GUI-specific pre-training data and action naming conflicts in datasets across different platforms.

  1. Cross-platform GUI Grounding Data Synthesis: The authors have created a substantial open-source toolkit for synthesizing GUI grounding data that spans multiple platforms such as Windows, Linux, MacOS, Android, and the web. This toolkit supports the development of the largest open-source cross-platform GUI corpus to date, containing 13 million GUI elements. This diverse dataset enhances the model’s capacity to generalize unseen interfaces.
  2. Unified Action Space for Training: In response to the problem of heterogeneity in action dataset content and format, the development of a unified action space resolves conflicts in action naming, thereby improving model generalization. This strategy effectively aligns action representations across varied datasets, strengthening the OS-Atlas’s adaptability and performance in multi-platform environments.
  3. Comprehensive Evaluation: OS-Atlas’s performance was evaluated across six benchmarks on three platforms: mobile, desktop, and web. The results demonstrate significant improvements over state-of-the-art models, confirming the potential of OS-Atlas as a robust alternative to closed-source VLMs in GUI agent development.

Numerical Performance and Bold Claims

The reported results indicate OS-Atlas’s strong performance, achieving state-of-the-art results across multiple complex benchmarks. This significant advancement suggests OS-Atlas can feasibly replace commercial models such as GPT-4o in the context of GUI agents. The paper further suggests that this performance edge is attributed to the novel grounding data collection process and the refined training methodologies implemented.

Implications and Future Directions

The implications of this work are profound, both practically and theoretically. Practically, OS-Atlas offers an open-source avenue for developing generalist GUI agents, which can drive innovation and accessibility in the field away from the constraints of proprietary VLMs. Theoretically, the paper sets a precedent for integrating broader cross-platform GUI data into the VLM training pipeline, potentially propelling future research on enhancing agent generalization across diverse digital environments.

In terms of future developments, the direction and emphasis on leveraging the breadth of the synthesized data and the unified action space are likely to continue. Further scaling of the grounding data, alongside refined training techniques, could enhance the VLM's performance on nuanced GUI tasks. Moreover, addressing advanced interaction tasks may require integrating adaptive learning mechanisms that can dynamically adjust to new environments and tasks.

In conclusion, the OS-Atlas framework represents a significant advance in the creation of GUI agents, driving open-source efforts to compete with and complement existing commercial models through innovative data and methodological strategies. This paper serves as a potential keystone for future research aimed at developing more efficient, generalizable, and open-access digital agents capable of cross-platform interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  2. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv preprint arXiv:2407.17490, 2024.
  3. Guicourse: From general vision language models to versatile GUI agents. CoRR, abs/2406.11317, 2024a. doi: 10.48550/ARXIV.2406.11317. URL https://doi.org/10.48550/arXiv.2406.11317.
  4. Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317, 2024b.
  5. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024c.
  6. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024d.
  7. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  8. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023a.
  9. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=kiYqbO3wqw.
  10. Agent ai: Surveying the horizons of multimodal interaction, 2024. URL https://arxiv.org/abs/2401.03568.
  11. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024.
  12. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In AAAI Conference on Artificial Intelligence, 2020. URL https://api.semanticscholar.org/CorpusID:229363676.
  13. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=VtmBAGCN7o.
  14. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14281–14290, 2024b.
  15. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. CoRR, abs/2402.17553, 2024. doi: 10.48550/ARXIV.2402.17553. URL https://doi.org/10.48550/arXiv.2402.17553.
  16. Pix2struct: Screenshot parsing as pretraining for visual language understanding. ArXiv, abs/2210.03347, 2022. URL https://api.semanticscholar.org/CorpusID:252762394.
  17. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024.
  18. Mapping natural language instructions to mobile UI action sequences. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8198–8210, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.729. URL https://aclanthology.org/2020.acl-main.729.
  19. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
  20. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  21. GUI odyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices. CoRR, abs/2406.08451, 2024a. doi: 10.48550/ARXIV.2406.08451. URL https://doi.org/10.48550/arXiv.2406.08451.
  22. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024b.
  23. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
  24. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
  25. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration. arXiv preprint arXiv:2310.00280, 2023.
  26. A survey of neural code intelligence: Paradigms, advances and beyond. arXiv preprint arXiv:2403.14734, 2024.
  27. AndroidEnv: A reinforcement learning platform for android. abs/2105.13231, 2021. URL http://arxiv.org/abs/2105.13231.
  28. Screen2words: Automatic mobile ui summarization with multimodal learning. The 34th Annual ACM Symposium on User Interface Software and Technology, 2021. URL https://api.semanticscholar.org/CorpusID:236957064.
  29. A survey on large language model based autonomous agents. http://arxiv.org/abs/2308.11432, 2023.
  30. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
  31. WebAIM. Webaim: The webaim million - the 2024 report on the accessibility of the top 1,000,000 home pages, 2024. URL https://webaim.org/projects/million/. Accessed on September 30, 2024.
  32. Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023. URL https://lilianweng.github.io/posts/2023-06-23-agent/.
  33. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  34. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  35. Interactive evolution: A neural-symbolic self-training framework for large language models. arXiv preprint arXiv:2406.11736, 2024.
  36. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  37. Lumos: Learning agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657, 2023.
  38. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  39. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  40. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
  41. Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024b.
  42. xlam: A family of large action models to empower ai agent systems. arXiv preprint arXiv:2409.03215, 2024c.
  43. Android in the zoo: Chain-of-action-thought for gui agents, 2024d.
  44. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning, 2024a. URL https://openreview.net/forum?id=piecKJ2DlB.
  45. Agentstudio: A toolkit for building general virtual agents. arXiv preprint arXiv:2403.17918, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Zhiyong Wu (171 papers)
  2. Zhenyu Wu (112 papers)
  3. Fangzhi Xu (22 papers)
  4. Yian Wang (26 papers)
  5. Qiushi Sun (26 papers)
  6. Chengyou Jia (17 papers)
  7. Kanzhi Cheng (14 papers)
  8. Zichen Ding (9 papers)
  9. Liheng Chen (13 papers)
  10. Paul Pu Liang (103 papers)
  11. Yu Qiao (563 papers)
Citations (3)