Cradle: Empowering Foundation Agents Towards General Computer Control (2403.03186v3)
Abstract: Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.
- GPT-4 Technical report. arXiv preprint arXiv:2303.08774, 2023.
- Video pretraining (VPT): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- The Arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
- Roberto Brunelli. Template matching techniques in computer vision: theory and practice. John Wiley & Sons, 2009.
- SeeClick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935, 2024.
- Mind2Web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023.
- ASSISTGUI: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
- Minerl: A large-scale dataset of Minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019.
- WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
- CogAgent: A visual language model for GUI agents. arXiv preprint arXiv:2312.08914, 2023.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
- The Malmo platform for artificial intelligence experimentation. In Ijcai, pages 4246–4247, 2016.
- OmniACT: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024.
- Christian Kauten. Super Mario Bros for OpenAI Gym. GitHub, 2018.
- Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, pages 4501–4510, 2020.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Grounding Dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Large language models play StarCraft II: Benchmarks and a chain of summarization approach. arXiv preprint arXiv:2312.11865, 2023.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
- GAIA: a benchmark for general AI assistants. arXiv preprint arXiv:2311.12983, 2023.
- Levels of AGI: Operationalizing progress on the path to AGI. arXiv preprint arXiv:2311.02462, 2023.
- ScreenAgent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945, 2024.
- CivRealm: A learning and reasoning odyssey in Civilization for decision-making agents. arXiv preprint arXiv:2401.10568, 2024.
- Android in the wild: A large-scale dataset for Android device control. arXiv preprint arXiv:2307.10088, 2023.
- The Starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
- AlphaStar: Mastering the real-time strategy game Starcraft II. DeepMind blog, 2:20, 2019.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
- Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
- OS-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
- Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
- A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges. Technical Report, 2024.
- GPT-4V in wonderland: Large multimodal models for zero-shot smartphone GUI navigation. arXiv preprint arXiv:2311.07562, 2023.
- AppAgent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
- UFO: A UI-focused agent for Windows OS interaction. arXiv preprint arXiv:2402.07939, 2024.
- GPT-4V(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
- Synapse: Trajectory-as-exemplar prompting with memory for computer control. In ICLR, 2024.
- WebArena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.