OS-Copilot: Autonomous OS Agent Framework
- OS-Copilot is a modular framework for autonomous OS agents that employs a refined sense–plan–act paradigm with integrated multi-timescale memory modules.
- Its architecture segments the workflow into a planner that creates executable subtasks, a configurator that consolidates diverse OS memories, and an actor that refines actions with critic feedback.
- The flagship instantiation, FRIDAY, demonstrates up to a 35% improvement in task success rates by leveraging self-directed tool generation and memory-augmented learning.
OS-Copilot is a modular framework for constructing generalist computer agents capable of autonomous interaction with diverse elements of modern operating systems. Its architecture enables agents not only to perform a broad array of tasks—ranging from manipulating files and code terminals to operating across web, multimedia, and third-party applications—but also to self-improve over time via memory-augmented learning and tool discovery. The flagship instantiation, FRIDAY, demonstrates superior generalization and self-directed skill acquisition, outperforming prior domain-specialized and open-ended agents on challenging OS orchestration benchmarks (Wu et al., 12 Feb 2024).
1. System Architecture and Components
OS-Copilot operationalizes the “sense–plan–act” paradigm, extended with multi-timescale memory modules. Its high-level workflow decomposes as follows:
- Planner: Transforms a user’s natural-language request into a directed acyclic graph (DAG) or linear chain of executable subtasks. Nodes correspond to granular OS operations; edges encode explicit precedence constraints. Topological scheduling supports parallel dispatch of independent subtasks.
- Configurator: Mimicking human memory, it aggregates information from three knowledge pools:
- Declarative memory: User profile (preferences, directories), semantic OS knowledge, past trajectories.
- Procedural memory: Repository of tools, initiated with a base set and extended dynamically by agent-generated, critic-scored Python classes.
- Working memory: Contextualizes each subtask with relevant facts and tools, constructing the prompt for execution.
- Actor: Implements an executor–critic–refiner loop. The executor emits low-level actions (Python, Bash, HTTP API, mouse/keyboard automation); the critic evaluates post-action states, providing binary success/failure judgments, error explanations, and repair strategies; the refiner invokes up to three remediation cycles (code or parameter rewrite) when failure is detected.
The architectural relationships among planner, configurator, and actor are mediated through long-term (declarative, procedural) and short-term (working memory) stores.
2. Formalism and Self-Improvement Objective
OS-Copilot is framed as learning to solve operating system tasks modeled by a Markov Decision Process (MDP):
- State : Symbolic OS snapshot (file tree, open windows, environment, CWD) at timestep .
- Action : Single tool invocation, raw command, or GUI event.
- Transition : Deterministic/stochastic evolution post-action.
- Reward : Defined by success/failure, with positive reward for completion, negative for failure, and a “tool-generalization” bonus for high-quality tool generation.
The learning objective combines cumulative reward and skill acquisition:
where is the LLM–retrieval–based policy, the evolving procedural toolset, a critic-assigned generality score, and a regularization hyperparameter. Tools with remain in memory.
3. FRIDAY: Agent Realization and Self-Directed Learning
FRIDAY ("Fully Responsive Intelligence Devoted to Assisting You") exemplifies OS-Copilot’s design atop GPT-4-Turbo:
- Bootstrapping: Begins with 4 basic tools (web search, page loader, audio-to-text, image caption).
- Developmental Learning: On the GAIA dev-set (100 tasks), FRIDAY autonomously generates 9 new atomic tools that generalize across applications. All tool additions are verified by the internal critic using post hoc analysis.
- Continuous Skill Discovery: Encountering a novel subtask for which no existing tool is adequate, an LLM-driven “Tool Generator” generates a new Python class. If subsequent use achieves a critic score , the tool is persisted in procedural memory, closing a self-improvement loop.
- Memory and Reuse: Execution trajectories, including actions, outcomes, and critic feedback, are archived as semantic memory; procedural memory indexes tool metadata for retrieval and prioritizes high-generality tools for future planning.
4. OS Element Interaction and Control APIs
OS-Copilot unifies interaction across four primary OS interfaces:
| Interface | Example Use Case | Implementation |
|---|---|---|
| Python Interpreter | openpyxl for Excel | Bash subprocess or direct import |
| Bash Shell | File operations, installs | OS shell subprocess |
| HTTP REST APIs | Online services, LLM calls | Requests, web APIs |
| Mouse/Keyboard Sim. | GUI automation (SeeClick) | Scripted events, visual grounding |
When automating GUI-exclusive applications, e.g., native Excel or PowerPoint, GUI elements are manipulated by visually grounded tools (e.g., SeeClick). For programmatic interfaces, native Python libraries (openpyxl, python-pptx) are leveraged. Each new tool created for these applications is subject to the critic’s success criteria.
5. Benchmark Evaluation and Empirical Analysis
Evaluation on the GAIA benchmark (466 tasks; Levels 1–3 difficulty) reveals the following relative “task success rates”:
| System | Level 1 | Level 2 | Level 3 |
|---|---|---|---|
| GPT-4 Plugins | 30.30% | 9.70% | 0% |
| AutoGPT-4 | 15.05% | 0.63% | 0% |
| FRIDAY (no tool learning) | 36.56% | 17.61% | 6.12% |
| FRIDAY | 40.86% | 20.13% | 6.12% |
FRIDAY records a 35% improvement over GPT-4 Plugins on Level 1. Ablation of tool-learning reduces performance by 4 points, quantifying the impact of continual self-directed tool acquisition (Wu et al., 12 Feb 2024).
On spreadsheet automation (SheetCopilot-20), off-the-shelf GPT-4 attains 55% task success; FRIDAY, after generating 8 new domain-specific tools, surpasses this with 60%. Qualitative analysis on PowerPoint tasks demonstrates the agent’s ability to synthesize tools for font, spacing, and image modification, successfully composing slides to user specification.
6. Algorithmic Workflow and Pseudocode
Subtask execution and self-directed learning are formalized as follows:
Subtask Execution Loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for each user_request: G ← Planner(user_request) # DAG of subtasks for node in topo_sort(G): cfg ← Configurator.retrieve(node) if no_tool_similar(cfg, node): new_tool ← ToolGenerator(node, cfg) save new_tool temporarily repeat up to 3 times: action ← Executor(cfg, node) result ← execute(action) critique ← Critic(result, node, cfg) if critique.judge == True: if new_tool exists and critique.score ≥ 8: ProceduralMemory.add(new_tool) break cfg ← Refiner(cfg, critique) # refine code or parameters |
1 2 3 4 5 |
given learning_goal:
curriculum = LLM_propose_tasks(learning_goal)
for subgoal in curriculum:
success ← FRIDAY.solve(subgoal)
# as above, new tools with score≥8 added to memory |
7. Limitations and Research Challenges
Key unresolved challenges include:
- Prompting vs. Fine-Tuning: Current performance is sensitive to prompt engineering; large-scale reinforcement learning (RL) remains aspirational due to trajectory data scarcity. An OpenAI-Gym style API is implemented to enable prospective RL/fine-tuning.
- Coverage for Multimodal and Closed-Source GUIs: Proprietary GUI automation demands robust screenshot-to-action models and visual grounding—areas addressed in contemporary frameworks such as CogAgent and SeeClick.
- Evaluation Robustness: Subtask success is inferred via OS state deltas (pre/post-snapshot comparison); lacking ground-truth, this inference is susceptible to brittleness, necessitating heuristic or LLM-augmented validation.
- Safety and Interpretability: Requirements for action-level justification and side-effect defense remain partially realized; integrating critic-based natural-language rationales and safety constraints is an ongoing direction.
- Personalization and Lifelong Learning: Scaling user profile adaptation and workflow personalization to millions of unique users poses challenges in memory architecture and retrieval.
This suggests that advances in memory consolidation/retrieval, multimodal control APIs, and scalable RL will be central to future general-purpose OS agents. The collaborative, memory-augmented architecture demonstrated by OS-Copilot provides foundational infrastructure for these trajectories (Wu et al., 12 Feb 2024).