OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (2402.07456v2)

Published 12 Feb 2024 in cs.AI

Abstract: Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of LLMs has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

PDF Abstract

OS-Copilot: Enabling Generalist Computer Agents through Self-Improvement

Introduction to OS-Copilot and FRIDAY

In the quest to augment digital assistance capabilities, OS-Copilot emerges as a pivotal framework designed to foster the development of generalist computer agents on Linux and MacOS platforms. By providing a unified interface for diverse operating system interaction methods, including Python code interpretation, bash terminal, mouse and keyboard control, and API calls, OS-Copilot significantly lowers the barriers to building sophisticated computer agents. The paper introduces FRIDAY, a self-improving, embodied agent developed atop the OS-Copilot framework, specifically tailored for automating a wide array of computer tasks. FRIDAY distinguishes itself by not only demonstrating exemplary performance in automation tasks but also showcasing an unparalleled ability to learn and control unfamiliar applications with minimal external guidance.

FRIDAY’s Architectural Overview

The architectural underpinning of FRIDAY revolves around a planner, configurator, and actor components synergy. The planner delineates complex tasks into manageable subtasks through a directed acyclic graph-based approach, allowing for parallel processing and efficient task management. The configurator, inspired by the human brain’s memory components, consists of declarative memory for storing user preferences and semantic knowledge, and procedural memory for housing a tool repository. This configuration facilitates FRIDAY’s learning and adaptation process, providing it with a continually evolving skill set. The actor component, comprising execution and self-criticism stages, executes subtasks within the operating system, employing the universal runtime environment provided by OS-Copilot for seamless operation across a diverse application spectrum.

Empirical Evaluation and Findings

FRIDAY was systematically evaluated on the GAIA benchmark, a comprehensive testbed for general AI assistants. The results were compelling, with FRIDAY achieving a 40.86% success rate in level-1 tasks, marking a 35% relative improvement over previous methods. Furthermore, FRIDAY demonstrated capabilities in self-directed learning, significantly enhancing its performance on spreadsheet manipulation tasks previously unsolvable. This illustrates FRIDAY’s robust self-improvement mechanics and its potential to transcend the capabilities of existing digital agents.

Implications and Future Directions

The introduction of OS-Copilot and the development of FRIDAY herald a significant advance in the field of generalist computer agents. This framework not only sets a new benchmark for agent capabilities but also provides a fertile ground for future research in personalized digital assistants, multi-modal agents, and agent learning in situated environments. The adaptability and self-improving nature of FRIDAY underscore the potential for more nuanced and autonomous computer agents capable of handling an increasingly broad array of tasks. Looking forward, integrating visual input and action generation capabilities, alongside enhancing multimodal interactions, remains a promising avenue for extending OS-Copilot’s utility and FRIDAY’s versatility. Moreover, addressing the challenges in agent evaluation, safety, and interpretability is critical for the practical deployment and acceptance of such advanced digital assistants.

Conclusion

The OS-Copilot framework, augmented by the FRIDAY agent, represents a significant stride towards realizing highly capable and general-purpose computer agents. This development not only expands the horizon for digital assistance technology but also propels forward the discussion on the potential and implications of autonomous agents in our daily computing environments.