Papers
Topics
Authors
Recent
2000 character limit reached

OS-Copilot: Autonomous OS Agent Framework

Updated 11 December 2025
  • OS-Copilot is a modular framework for autonomous OS agents that employs a refined sense–plan–act paradigm with integrated multi-timescale memory modules.
  • Its architecture segments the workflow into a planner that creates executable subtasks, a configurator that consolidates diverse OS memories, and an actor that refines actions with critic feedback.
  • The flagship instantiation, FRIDAY, demonstrates up to a 35% improvement in task success rates by leveraging self-directed tool generation and memory-augmented learning.

OS-Copilot is a modular framework for constructing generalist computer agents capable of autonomous interaction with diverse elements of modern operating systems. Its architecture enables agents not only to perform a broad array of tasks—ranging from manipulating files and code terminals to operating across web, multimedia, and third-party applications—but also to self-improve over time via memory-augmented learning and tool discovery. The flagship instantiation, FRIDAY, demonstrates superior generalization and self-directed skill acquisition, outperforming prior domain-specialized and open-ended agents on challenging OS orchestration benchmarks (Wu et al., 12 Feb 2024).

1. System Architecture and Components

OS-Copilot operationalizes the “sense–plan–act” paradigm, extended with multi-timescale memory modules. Its high-level workflow decomposes as follows:

  • Planner: Transforms a user’s natural-language request into a directed acyclic graph (DAG) or linear chain of executable subtasks. Nodes correspond to granular OS operations; edges encode explicit precedence constraints. Topological scheduling supports parallel dispatch of independent subtasks.
  • Configurator: Mimicking human memory, it aggregates information from three knowledge pools:
    • Declarative memory: User profile (preferences, directories), semantic OS knowledge, past trajectories.
    • Procedural memory: Repository of tools, initiated with a base set and extended dynamically by agent-generated, critic-scored Python classes.
    • Working memory: Contextualizes each subtask with relevant facts and tools, constructing the prompt for execution.
  • Actor: Implements an executor–critic–refiner loop. The executor emits low-level actions (Python, Bash, HTTP API, mouse/keyboard automation); the critic evaluates post-action states, providing binary success/failure judgments, error explanations, and repair strategies; the refiner invokes up to three remediation cycles (code or parameter rewrite) when failure is detected.

The architectural relationships among planner, configurator, and actor are mediated through long-term (declarative, procedural) and short-term (working memory) stores.

2. Formalism and Self-Improvement Objective

OS-Copilot is framed as learning to solve operating system tasks modeled by a Markov Decision Process (MDP):

  • State sts_t: Symbolic OS snapshot (file tree, open windows, environment, CWD) at timestep tt.
  • Action ata_t: Single tool invocation, raw command, or GUI event.
  • Transition P(st+1st,at)P(s_{t+1}|s_t, a_t): Deterministic/stochastic evolution post-action.
  • Reward r(st,at)r(s_t, a_t): Defined by success/failure, with positive reward for completion, negative for failure, and a “tool-generalization” bonus for high-quality tool generation.

The learning objective combines cumulative reward and skill acquisition:

maxπ,T    Eτπ[t=0Tr(st,at)]+λtTg(t)\max_{\pi, \mathcal{T}} \;\; \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T r(s_t, a_t)\right] + \lambda \sum_{t \in \mathcal{T}} g(t)

where π\pi is the LLM–retrieval–based policy, T\mathcal{T} the evolving procedural toolset, g(t)[0,10]g(t) \in [0,10] a critic-assigned generality score, and λ\lambda a regularization hyperparameter. Tools with g(t)8g(t) \geq 8 remain in memory.

3. FRIDAY: Agent Realization and Self-Directed Learning

FRIDAY ("Fully Responsive Intelligence Devoted to Assisting You") exemplifies OS-Copilot’s design atop GPT-4-Turbo:

  • Bootstrapping: Begins with 4 basic tools (web search, page loader, audio-to-text, image caption).
  • Developmental Learning: On the GAIA dev-set (\sim100 tasks), FRIDAY autonomously generates \sim9 new atomic tools that generalize across applications. All tool additions are verified by the internal critic using post hoc analysis.
  • Continuous Skill Discovery: Encountering a novel subtask for which no existing tool is adequate, an LLM-driven “Tool Generator” generates a new Python class. If subsequent use achieves a critic score g(t)8g(t) \geq 8, the tool is persisted in procedural memory, closing a self-improvement loop.
  • Memory and Reuse: Execution trajectories, including actions, outcomes, and critic feedback, are archived as semantic memory; procedural memory indexes tool metadata for retrieval and prioritizes high-generality tools for future planning.

4. OS Element Interaction and Control APIs

OS-Copilot unifies interaction across four primary OS interfaces:

Interface Example Use Case Implementation
Python Interpreter openpyxl for Excel Bash subprocess or direct import
Bash Shell File operations, installs OS shell subprocess
HTTP REST APIs Online services, LLM calls Requests, web APIs
Mouse/Keyboard Sim. GUI automation (SeeClick) Scripted events, visual grounding

When automating GUI-exclusive applications, e.g., native Excel or PowerPoint, GUI elements are manipulated by visually grounded tools (e.g., SeeClick). For programmatic interfaces, native Python libraries (openpyxl, python-pptx) are leveraged. Each new tool created for these applications is subject to the critic’s success criteria.

5. Benchmark Evaluation and Empirical Analysis

Evaluation on the GAIA benchmark (466 tasks; Levels 1–3 difficulty) reveals the following relative “task success rates”:

System Level 1 Level 2 Level 3
GPT-4 Plugins 30.30% 9.70% 0%
AutoGPT-4 15.05% 0.63% 0%
FRIDAY (no tool learning) 36.56% 17.61% 6.12%
FRIDAY 40.86% 20.13% 6.12%

FRIDAY records a 35% improvement over GPT-4 Plugins on Level 1. Ablation of tool-learning reduces performance by 4 points, quantifying the impact of continual self-directed tool acquisition (Wu et al., 12 Feb 2024).

On spreadsheet automation (SheetCopilot-20), off-the-shelf GPT-4 attains 55% task success; FRIDAY, after generating 8 new domain-specific tools, surpasses this with 60%. Qualitative analysis on PowerPoint tasks demonstrates the agent’s ability to synthesize tools for font, spacing, and image modification, successfully composing slides to user specification.

6. Algorithmic Workflow and Pseudocode

Subtask execution and self-directed learning are formalized as follows:

Subtask Execution Loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for each user_request:
    G  Planner(user_request)      # DAG of subtasks
    for node in topo_sort(G):
        cfg  Configurator.retrieve(node)
        if no_tool_similar(cfg, node):
            new_tool  ToolGenerator(node, cfg)
            save new_tool temporarily
        repeat up to 3 times:
            action  Executor(cfg, node)
            result  execute(action)
            critique  Critic(result, node, cfg)
            if critique.judge == True:
                if new_tool exists and critique.score  8:
                    ProceduralMemory.add(new_tool)
                break
            cfg  Refiner(cfg, critique)  # refine code or parameters
Self-Directed Learning:

1
2
3
4
5
given learning_goal:
    curriculum = LLM_propose_tasks(learning_goal)
    for subgoal in curriculum:
        success  FRIDAY.solve(subgoal)
        # as above, new tools with score≥8 added to memory
This structure supports the iterative refinement and accumulation of reusable skills, with critic-guided tool validation embedded in the workflow.

7. Limitations and Research Challenges

Key unresolved challenges include:

  • Prompting vs. Fine-Tuning: Current performance is sensitive to prompt engineering; large-scale reinforcement learning (RL) remains aspirational due to trajectory data scarcity. An OpenAI-Gym style API is implemented to enable prospective RL/fine-tuning.
  • Coverage for Multimodal and Closed-Source GUIs: Proprietary GUI automation demands robust screenshot-to-action models and visual grounding—areas addressed in contemporary frameworks such as CogAgent and SeeClick.
  • Evaluation Robustness: Subtask success is inferred via OS state deltas (pre/post-snapshot comparison); lacking ground-truth, this inference is susceptible to brittleness, necessitating heuristic or LLM-augmented validation.
  • Safety and Interpretability: Requirements for action-level justification and side-effect defense remain partially realized; integrating critic-based natural-language rationales and safety constraints is an ongoing direction.
  • Personalization and Lifelong Learning: Scaling user profile adaptation and workflow personalization to millions of unique users poses challenges in memory architecture and retrieval.

This suggests that advances in memory consolidation/retrieval, multimodal control APIs, and scalable RL will be central to future general-purpose OS agents. The collaborative, memory-augmented architecture demonstrated by OS-Copilot provides foundational infrastructure for these trajectories (Wu et al., 12 Feb 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OS-Copilot.