Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (2402.17553v3)

Published 27 Feb 2024 in cs.AI, cs.CL, cs.CV, and cs.HC
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline LLM agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of LLM agents in automating computer tasks and motivates future work towards building multimodal models that bridge LLMs and the visual grounding of computer screens.

OmniACT: Setting New Benchmarks for Multimodal Autonomous Agents in Desktop and Web Environments

Overview

Recent advancements in AI have aimed to simplify human-computer interactions by developing autonomous virtual agents capable of executing tasks with minimal human input. These tasks, ranging from mundane activities like playing music to more complex sequences such as sending emails, significantly depend on the agent's ability to interpret natural language instructions and transform them into executable actions. Despite the proliferation of such intelligent systems, the gap between human proficiency and autonomous agents remains vast, particularly in multimodal contexts involving both desktop and web applications. To bridge this gap, the paper introduces OmniACT, a novel dataset and benchmark designed to assess the capabilities of autonomous agents in generating executable programs for comprehensive computer tasks based on visually-grounded natural language instructions.

OmniACT Dataset: A New Frontier

The OmniACT dataset is unprecedented in its scope, encompassing a wide array of tasks across various desktop and web applications. With over 9.8K task pairs, including screenshots of user interfaces (UIs) and corresponding natural language instructions, OmniACT extends beyond conventional web automation. The dataset's unique challenge lies in the agent's need to navigate through different operating systems (macOS, Windows, Linux) and web domains, making it the first dataset to focus on such a diverse range of applications for autonomous agents.

Methodological Insights

The paper lays out an exhaustive methodology for dataset preparation, focusing on the compilation of tasks that span across multiple domains on both desktop and web applications. By carefully annotating UI elements and collecting tasks through human annotation, the researchers ensured the dataset's relevance and complexity. Key to this process was the development of PyAutoGUI-derived executable tasks, offering a pragmatic approach to automating user interactions across varied applications.

Performance Benchmarking

Evaluating several state-of-the-art LLM-based agents, including GPT-4, the paper encapsulates the challenges inherent in the OmniACT benchmark. Despite GPT-4's superior performance relative to other baselines, it achieves only 15% of human proficiency, underscoring the significant challenge the OmniACT tasks present to current AI models. This finding not only illustrates the dataset's complexity but also highlights the necessity for advancements in multimodal models that can better understand and interact with both visual and textual information.

Implications and Future Directions

The implications of this research are twofold. Practically, improving autonomous agents' performance on OmniACT tasks could revolutionize how we interact with computers, making technology more accessible to users with limited technical skills and streamlining routine tasks. Theoretically, the research underscores the importance of developing more sophisticated multimodal models that integrate visual cues with natural language processing. As such models evolve, we can anticipate significant breakthroughs in AI's ability to understand and navigate complex, multimodal environments.

Concluding Thoughts

In conclusion, OmniACT represents a substantial step forward in the quest to develop generalist autonomous agents capable of executing a broad spectrum of computer tasks. By providing a challenging benchmark, the dataset not only facilitates the evaluation of current AI models but also sets a clear direction for future research. Enhancing the capabilities of autonomous agents in this domain will undoubtedly have far-reaching implications, from the democratization of technology to the automation of laborious tasks, heralding a new era in human-computer interaction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Pyautogui: A cross-platform gui automation python module for human beings. https://github.com/asweigart/pyautogui, 2023.
  2. Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, 2023.
  3. Uibert: Learning generic multimodal representations for ui understanding, 2021.
  4. Lexi: Self-supervised learning of the ui language, 2023.
  5. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments.
  6. Websrc: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, 2021.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  8. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, page 845–854, New York, NY, USA, 2017. Association for Computing Machinery.
  9. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  10. Qlora: Efficient finetuning of quantized llms, 2023.
  11. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023.
  12. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  13. Learning to navigate the web. In International Conference on Learning Representations, 2018.
  14. Actionbert: Leveraging user actions for semantic understanding of user interfaces, 2021.
  15. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers, 2020.
  16. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR, 2022.
  17. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  18. Segment anything, 2023.
  19. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.
  20. Spotlight: Mobile ui understanding using vision-language models with a focus, 2023.
  21. Mapping natural language instructions to mobile ui action sequences, 2020a.
  22. Widget captioning: Generating natural language description for mobile user interface elements, 2020b.
  23. Vut: Versatile ui transformer for multi-modal multi-task user interface modeling, 2021.
  24. Improved baselines with visual instruction tuning, 2023a.
  25. Visual instruction tuning, 2023b.
  26. Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
  27. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  28. Webgpt: Browser-assisted question-answering with human feedback.
  29. OpenAI. Introducing chatgpt, 2023a.
  30. OpenAI. Gpt-4 technical report, 2023b.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  32. Android in the wild: A large-scale dataset for android device control, 2023.
  33. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
  34. Code llama: Open foundation models for code, 2023a.
  35. Code llama: Open foundation models for code, 2023b.
  36. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
  37. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  38. Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023.
  39. Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699–6712, 2022.
  40. Vipergpt: Visual inference via python execution for reasoning, 2023.
  41. Writer Engineering team. InstructPalmyra-30b : Instruct tuned Palmyra-Large model. https://dev.writer.com, 2023a.
  42. Writer Engineering team. Palmyra-base Parameter Autoregressive Language Model. https://dev.writer.com, 2023b.
  43. Llama: Open and efficient foundation language models, 2023.
  44. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021.
  45. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023a.
  46. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  47. Large-scale multi-modal pre-trained models: A comprehensive survey, 2023b.
  48. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  49. Grounding open-domain instructions to automate web support tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1022–1032, 2021.
  50. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  51. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  52. A survey on multimodal large language models, 2023.
  53. Bertscore: Evaluating text generation with bert, 2020.
  54. You only look at screens: Multimodal chain-of-action agents, 2023.
  55. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  56. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527, 2023a.
  57. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Raghav Kapoor (7 papers)
  2. Yash Parag Butala (2 papers)
  3. Melisa Russak (5 papers)
  4. Jing Yu Koh (18 papers)
  5. Kiran Kamble (5 papers)
  6. Ruslan Salakhutdinov (248 papers)
  7. Waseem AlShikh (6 papers)
Citations (26)