Cradle: Empowering Foundation Agents Towards General Computer Control (2403.03186v3)

Published 5 Mar 2024 in cs.AI

Abstract: Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.

References (47)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel framework for General Computer Control by deploying a multimodal agent in RDR2, emphasizing self-reflection, task inference, and skill curation.
It demonstrates the agent’s robust reasoning capabilities across diverse tasks such as navigation and combat within a highly interactive gaming environment.
Findings highlight limitations in GPT-4V’s spatial perception and long-term context handling, outlining clear directions for future advances in large multimodal models.

Exploring General Computer Control Through a Multimodal Agent in Red Dead Redemption II

Introduction to General Computer Control (GCC)

The concept of General Computer Control (GCC) introduces a compelling paradigm for building foundation agents capable of mastering any computer task. This is achieved through the agent’s interaction with the computer using standard interfaces: screen images (and possibly audio) as inputs and keyboard and mouse operations as outputs. The overarching challenges in realizing GCC include handling multimodal observations for informed decision-making, ensuring precise control over keyboard and mouse for interaction, requiring long-term memory and sophisticated reasoning, and supporting efficient exploration and self-improvement by the agent.

The Framework

In response to the complex requirements of GCC, the framework presents a novel architecture that prioritizes strong reasoning abilities. This framework includes self-reflection, task inference, and skill curation capabilities to ensure the agent's adaptability across various tasks and its capacity for self-improvement. To validate the potential of , the framework is deployed in the challenging environment of Red Dead Redemption II (RDR2), a complex AAA game known for its dense and interactive gameplay elements. This deployment represents a significant step toward achieving GCC by demonstrating the framework's ability to navigate, understand, and interact within such a demanding setting.

Empirical Studies and Challenges

The deployment of in RDR2 targets two primary missions from the game's storyline, focusing on a range of tasks from basic navigation to combat. The agent’s performance in these tasks highlights the effectiveness of the framework's reasoning abilities and its potential for generalization across different computer tasks. However, this exploration also uncovers limitations within GPT-4V, particularly in spatial perception, icon understanding, history processing, and world understanding, suggesting areas for future research and development.

Limitations of GPT-4V

While GPT-4V offers powerful multimodal capabilities, its current iteration exhibits limitations that affect the agent's performance in complex environments like RDR2. Issues include difficulty in spatial reasoning, identifying game-specific icons, handling longer contexts without hallucination, and understanding the game's world model. These limitations necessitate external tools and modifications to improve agent interaction and decision-making, pointing to the necessity of advancements in large multimodal models (LMMs) to better support GCC in highly interactive and visually dense environments.

Conclusion and Future Directions

The development and deployment of within the context of GCC and its application to RDR2 mark an important advancement toward realizing agents capable of general computer control. By addressing the challenges identified in the framework's current implementation, particularly those related to the limitations of GPT-4V, future work can enhance the agent's reasoning capabilities and generalizability. This progress will pave the way for foundation agents that can competently navigate and interact with a vast range of computer-based tasks, bringing us closer to the goal of achieving artificial general intelligence in the digital field. Future work will also explore extending to various games and software applications, incorporating audio inputs, and developing interactive benchmarks to measure the capabilities of foundation agents comprehensively.