Papers
Topics
Authors
Recent
2000 character limit reached

Cradle: Empowering Foundation Agents Towards General Computer Control (2403.03186v3)

Published 5 Mar 2024 in cs.AI

Abstract: Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. GPT-4 Technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Video pretraining (VPT): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
  3. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
  4. The Arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  5. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  6. Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
  7. Roberto Brunelli. Template matching techniques in computer vision: theory and practice. John Wiley & Sons, 2009.
  8. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935, 2024.
  9. Mind2Web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  10. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  11. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  12. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023.
  13. ASSISTGUI: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  14. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  15. Minerl: A large-scale dataset of Minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019.
  16. WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  17. CogAgent: A visual language model for GUI agents. arXiv preprint arXiv:2312.08914, 2023.
  18. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  19. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
  20. The Malmo platform for artificial intelligence experimentation. In Ijcai, pages 4246–4247, 2016.
  21. OmniACT: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024.
  22. Christian Kauten. Super Mario Bros for OpenAI Gym. GitHub, 2018.
  23. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, pages 4501–4510, 2020.
  24. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  25. Grounding Dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  26. Large language models play StarCraft II: Benchmarks and a chain of summarization approach. arXiv preprint arXiv:2312.11865, 2023.
  27. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  28. GAIA: a benchmark for general AI assistants. arXiv preprint arXiv:2311.12983, 2023.
  29. Levels of AGI: Operationalizing progress on the path to AGI. arXiv preprint arXiv:2311.02462, 2023.
  30. ScreenAgent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945, 2024.
  31. CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024.
  32. Android in the wild: A large-scale dataset for Android device control. arXiv preprint arXiv:2307.10088, 2023.
  33. The Starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
  34. AlphaStar: Mastering the real-time strategy game Starcraft II. DeepMind blog, 2:20, 2019.
  35. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  36. Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  37. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023.
  38. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
  39. OS-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  40. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
  41. A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges. Technical Report, 2024.
  42. GPT-4V in wonderland: Large multimodal models for zero-shot smartphone GUI navigation. arXiv preprint arXiv:2311.07562, 2023.
  43. AppAgent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  44. UFO: A UI-focused agent for Windows OS interaction. arXiv preprint arXiv:2402.07939, 2024.
  45. GPT-4V(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  46. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In ICLR, 2024.
  47. WebArena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
Citations (14)

Summary

  • The paper introduces a novel framework for General Computer Control by deploying a multimodal agent in RDR2, emphasizing self-reflection, task inference, and skill curation.
  • It demonstrates the agent’s robust reasoning capabilities across diverse tasks such as navigation and combat within a highly interactive gaming environment.
  • Findings highlight limitations in GPT-4V’s spatial perception and long-term context handling, outlining clear directions for future advances in large multimodal models.

Exploring General Computer Control Through a Multimodal Agent in Red Dead Redemption II

Introduction to General Computer Control (GCC)

The concept of General Computer Control (GCC) introduces a compelling paradigm for building foundation agents capable of mastering any computer task. This is achieved through the agent’s interaction with the computer using standard interfaces: screen images (and possibly audio) as inputs and keyboard and mouse operations as outputs. The overarching challenges in realizing GCC include handling multimodal observations for informed decision-making, ensuring precise control over keyboard and mouse for interaction, requiring long-term memory and sophisticated reasoning, and supporting efficient exploration and self-improvement by the agent.

The Framework

In response to the complex requirements of GCC, the framework presents a novel architecture that prioritizes strong reasoning abilities. This framework includes self-reflection, task inference, and skill curation capabilities to ensure the agent's adaptability across various tasks and its capacity for self-improvement. To validate the potential of , the framework is deployed in the challenging environment of Red Dead Redemption II (RDR2), a complex AAA game known for its dense and interactive gameplay elements. This deployment represents a significant step toward achieving GCC by demonstrating the framework's ability to navigate, understand, and interact within such a demanding setting.

Empirical Studies and Challenges

The deployment of in RDR2 targets two primary missions from the game's storyline, focusing on a range of tasks from basic navigation to combat. The agent’s performance in these tasks highlights the effectiveness of the framework's reasoning abilities and its potential for generalization across different computer tasks. However, this exploration also uncovers limitations within GPT-4V, particularly in spatial perception, icon understanding, history processing, and world understanding, suggesting areas for future research and development.

Limitations of GPT-4V

While GPT-4V offers powerful multimodal capabilities, its current iteration exhibits limitations that affect the agent's performance in complex environments like RDR2. Issues include difficulty in spatial reasoning, identifying game-specific icons, handling longer contexts without hallucination, and understanding the game's world model. These limitations necessitate external tools and modifications to improve agent interaction and decision-making, pointing to the necessity of advancements in large multimodal models (LMMs) to better support GCC in highly interactive and visually dense environments.

Conclusion and Future Directions

The development and deployment of within the context of GCC and its application to RDR2 mark an important advancement toward realizing agents capable of general computer control. By addressing the challenges identified in the framework's current implementation, particularly those related to the limitations of GPT-4V, future work can enhance the agent's reasoning capabilities and generalizability. This progress will pave the way for foundation agents that can competently navigate and interact with a vast range of computer-based tasks, bringing us closer to the goal of achieving artificial general intelligence in the digital field. Future work will also explore extending to various games and software applications, incorporating audio inputs, and developing interactive benchmarks to measure the capabilities of foundation agents comprehensively.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 509 likes about this paper.