Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

148

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (2402.17553v3)

Published 27 Feb 2024 in cs.AI, cs.CL, cs.CV, and cs.HC

Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline LLM agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of LLM agents in automating computer tasks and motivates future work towards building multimodal models that bridge LLMs and the visual grounding of computer screens.

PDF HTML Abstract

OmniACT: Setting New Benchmarks for Multimodal Autonomous Agents in Desktop and Web Environments

Overview

Recent advancements in AI have aimed to simplify human-computer interactions by developing autonomous virtual agents capable of executing tasks with minimal human input. These tasks, ranging from mundane activities like playing music to more complex sequences such as sending emails, significantly depend on the agent's ability to interpret natural language instructions and transform them into executable actions. Despite the proliferation of such intelligent systems, the gap between human proficiency and autonomous agents remains vast, particularly in multimodal contexts involving both desktop and web applications. To bridge this gap, the paper introduces OmniACT, a novel dataset and benchmark designed to assess the capabilities of autonomous agents in generating executable programs for comprehensive computer tasks based on visually-grounded natural language instructions.

OmniACT Dataset: A New Frontier

The OmniACT dataset is unprecedented in its scope, encompassing a wide array of tasks across various desktop and web applications. With over 9.8K task pairs, including screenshots of user interfaces (UIs) and corresponding natural language instructions, OmniACT extends beyond conventional web automation. The dataset's unique challenge lies in the agent's need to navigate through different operating systems (macOS, Windows, Linux) and web domains, making it the first dataset to focus on such a diverse range of applications for autonomous agents.

Methodological Insights

The paper lays out an exhaustive methodology for dataset preparation, focusing on the compilation of tasks that span across multiple domains on both desktop and web applications. By carefully annotating UI elements and collecting tasks through human annotation, the researchers ensured the dataset's relevance and complexity. Key to this process was the development of PyAutoGUI-derived executable tasks, offering a pragmatic approach to automating user interactions across varied applications.

Performance Benchmarking

Evaluating several state-of-the-art LLM-based agents, including GPT-4, the paper encapsulates the challenges inherent in the OmniACT benchmark. Despite GPT-4's superior performance relative to other baselines, it achieves only 15% of human proficiency, underscoring the significant challenge the OmniACT tasks present to current AI models. This finding not only illustrates the dataset's complexity but also highlights the necessity for advancements in multimodal models that can better understand and interact with both visual and textual information.

Implications and Future Directions

The implications of this research are twofold. Practically, improving autonomous agents' performance on OmniACT tasks could revolutionize how we interact with computers, making technology more accessible to users with limited technical skills and streamlining routine tasks. Theoretically, the research underscores the importance of developing more sophisticated multimodal models that integrate visual cues with natural language processing. As such models evolve, we can anticipate significant breakthroughs in AI's ability to understand and navigate complex, multimodal environments.

Concluding Thoughts

In conclusion, OmniACT represents a substantial step forward in the quest to develop generalist autonomous agents capable of executing a broad spectrum of computer tasks. By providing a challenging benchmark, the dataset not only facilitates the evaluation of current AI models but also sets a clear direction for future research. Enhancing the capabilities of autonomous agents in this domain will undoubtedly have far-reaching implications, from the democratization of technology to the automation of laborious tasks, heralding a new era in human-computer interaction.

PDF Markdown Bookmark Chat (Pro)

References (57)

Authors (7)

Raghav Kapoor (7 papers)
Yash Parag Butala (2 papers)
Melisa Russak (5 papers)
Jing Yu Koh (18 papers)
Kiran Kamble (5 papers)
Ruslan Salakhutdinov (248 papers)
Waseem AlShikh (6 papers)

Citations (26)

View on Semantic Scholar

Tweets

https://twitter.com/iScienceLuvr/status/1762685270527267052

https://twitter.com/_akhaliq/status/1762689919879626768

https://twitter.com/arxivsanitybot/status/1763385197431570614

https://twitter.com/gm8xx8/status/1762685538367099374