OS-Copilot: Autonomous OS Agent Framework

Updated 11 December 2025

OS-Copilot is a modular framework for autonomous OS agents that employs a refined sense–plan–act paradigm with integrated multi-timescale memory modules.
Its architecture segments the workflow into a planner that creates executable subtasks, a configurator that consolidates diverse OS memories, and an actor that refines actions with critic feedback.
The flagship instantiation, FRIDAY, demonstrates up to a 35% improvement in task success rates by leveraging self-directed tool generation and memory-augmented learning.

OS-Copilot is a modular framework for constructing generalist computer agents capable of autonomous interaction with diverse elements of modern operating systems. Its architecture enables agents not only to perform a broad array of tasks—ranging from manipulating files and code terminals to operating across web, multimedia, and third-party applications—but also to self-improve over time via memory-augmented learning and tool discovery. The flagship instantiation, FRIDAY, demonstrates superior generalization and self-directed skill acquisition, outperforming prior domain-specialized and open-ended agents on challenging OS orchestration benchmarks (Wu et al., 2024).

1. System Architecture and Components

OS-Copilot operationalizes the “sense–plan–act” paradigm, extended with multi-timescale memory modules. Its high-level workflow decomposes as follows:

Planner: Transforms a user’s natural-language request into a directed acyclic graph (DAG) or linear chain of executable subtasks. Nodes correspond to granular OS operations; edges encode explicit precedence constraints. Topological scheduling supports parallel dispatch of independent subtasks.
Configurator: Mimicking human memory, it aggregates information from three knowledge pools:
- Declarative memory: User profile (preferences, directories), semantic OS knowledge, past trajectories.
- Procedural memory: Repository of tools, initiated with a base set and extended dynamically by agent-generated, critic-scored Python classes.
- Working memory: Contextualizes each subtask with relevant facts and tools, constructing the prompt for execution.
Actor: Implements an executor–critic–refiner loop. The executor emits low-level actions (Python, Bash, HTTP API, mouse/keyboard automation); the critic evaluates post-action states, providing binary success/failure judgments, error explanations, and repair strategies; the refiner invokes up to three remediation cycles (code or parameter rewrite) when failure is detected.

The architectural relationships among planner, configurator, and actor are mediated through long-term (declarative, procedural) and short-term (working memory) stores.

2. Formalism and Self-Improvement Objective

OS-Copilot is framed as learning to solve operating system tasks modeled by a Markov Decision Process (MDP):

State $s_t$ : Symbolic OS snapshot (file tree, open windows, environment, CWD) at timestep $t$ .
Action $a_t$ : Single tool invocation, raw command, or GUI event.
Transition $P(s_{t+1}|s_t, a_t)$ : Deterministic/stochastic evolution post-action.
Reward $r(s_t, a_t)$ : Defined by success/failure, with positive reward for completion, negative for failure, and a “tool-generalization” bonus for high-quality tool generation.

The learning objective combines cumulative reward and skill acquisition:

$\max_{\pi, \mathcal{T}} \;\; \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T r(s_t, a_t)\right] + \lambda \sum_{t \in \mathcal{T}} g(t)$

where $\pi$ is the LLM–retrieval–based policy, $\mathcal{T}$ the evolving procedural toolset, $g(t) \in [0,10]$ a critic-assigned generality score, and $\lambda$ a regularization hyperparameter. Tools with $g(t) \geq 8$ remain in memory.

3. FRIDAY: Agent Realization and Self-Directed Learning

FRIDAY ("Fully Responsive Intelligence Devoted to Assisting You") exemplifies OS-Copilot’s design atop GPT-4-Turbo:

Bootstrapping: Begins with 4 basic tools (web search, page loader, audio-to-text, image caption).
Developmental Learning: On the GAIA dev-set ( $\sim$ 100 tasks), FRIDAY autonomously generates $\sim$ 9 new atomic tools that generalize across applications. All tool additions are verified by the internal critic using post hoc analysis.
Continuous Skill Discovery: Encountering a novel subtask for which no existing tool is adequate, an LLM-driven “Tool Generator” generates a new Python class. If subsequent use achieves a critic score $g(t) \geq 8$ , the tool is persisted in procedural memory, closing a self-improvement loop.
Memory and Reuse: Execution trajectories, including actions, outcomes, and critic feedback, are archived as semantic memory; procedural memory indexes tool metadata for retrieval and prioritizes high-generality tools for future planning.

4. OS Element Interaction and Control APIs

OS-Copilot unifies interaction across four primary OS interfaces:

Interface	Example Use Case	Implementation
Python Interpreter	openpyxl for Excel	Bash subprocess or direct import
Bash Shell	File operations, installs	OS shell subprocess
HTTP REST APIs	Online services, LLM calls	Requests, web APIs
Mouse/Keyboard Sim.	GUI automation (SeeClick)	Scripted events, visual grounding

When automating GUI-exclusive applications, e.g., native Excel or PowerPoint, GUI elements are manipulated by visually grounded tools (e.g., SeeClick). For programmatic interfaces, native Python libraries (openpyxl, python-pptx) are leveraged. Each new tool created for these applications is subject to the critic’s success criteria.

5. Benchmark Evaluation and Empirical Analysis

Evaluation on the GAIA benchmark (466 tasks; Levels 1–3 difficulty) reveals the following relative “task success rates”:

System	Level 1	Level 2	Level 3
GPT-4 Plugins	30.30%	9.70%	0%
AutoGPT-4	15.05%	0.63%	0%
FRIDAY (no tool learning)	36.56%	17.61%	6.12%
FRIDAY	40.86%	20.13%	6.12%

FRIDAY records a 35% improvement over GPT-4 Plugins on Level 1. Ablation of tool-learning reduces performance by 4 points, quantifying the impact of continual self-directed tool acquisition (Wu et al., 2024).

On spreadsheet automation (SheetCopilot-20), off-the-shelf GPT-4 attains 55% task success; FRIDAY, after generating 8 new domain-specific tools, surpasses this with 60%. Qualitative analysis on PowerPoint tasks demonstrates the agent’s ability to synthesize tools for font, spacing, and image modification, successfully composing slides to user specification.

6. Algorithmic Workflow and Pseudocode

Subtask execution and self-directed learning are formalized as follows:

Subtask Execution Loop:

for each user_request:
    G ← Planner(user_request)      # DAG of subtasks
    for node in topo_sort(G):
        cfg ← Configurator.retrieve(node)
        if no_tool_similar(cfg, node):
            new_tool ← ToolGenerator(node, cfg)
            save new_tool temporarily
        repeat up to 3 times:
            action ← Executor(cfg, node)
            result ← execute(action)
            critique ← Critic(result, node, cfg)
            if critique.judge == True:
                if new_tool exists and critique.score ≥ 8:
                    ProceduralMemory.add(new_tool)
                break
            cfg ← Refiner(cfg, critique)  # refine code or parameters

Self-Directed Learning:

given learning_goal:
    curriculum = LLM_propose_tasks(learning_goal)
    for subgoal in curriculum:
        success ← FRIDAY.solve(subgoal)
        # as above, new tools with score≥8 added to memory

This structure supports the iterative refinement and accumulation of reusable skills, with critic-guided tool validation embedded in the workflow.

7. Limitations and Research Challenges

Key unresolved challenges include:

Prompting vs. Fine-Tuning: Current performance is sensitive to prompt engineering; large-scale reinforcement learning (RL) remains aspirational due to trajectory data scarcity. An OpenAI-Gym style API is implemented to enable prospective RL/fine-tuning.
Coverage for Multimodal and Closed-Source GUIs: Proprietary GUI automation demands robust screenshot-to-action models and visual grounding—areas addressed in contemporary frameworks such as CogAgent and SeeClick.
Evaluation Robustness: Subtask success is inferred via OS state deltas (pre/post-snapshot comparison); lacking ground-truth, this inference is susceptible to brittleness, necessitating heuristic or LLM-augmented validation.
Safety and Interpretability: Requirements for action-level justification and side-effect defense remain partially realized; integrating critic-based natural-language rationales and safety constraints is an ongoing direction.
Personalization and Lifelong Learning: Scaling user profile adaptation and workflow personalization to millions of unique users poses challenges in memory architecture and retrieval.

This suggests that advances in memory consolidation/retrieval, multimodal control APIs, and scalable RL will be central to future general-purpose OS agents. The collaborative, memory-augmented architecture demonstrated by OS-Copilot provides foundational infrastructure for these trajectories (Wu et al., 2024).

Markdown Upgrade to Chat

References (1)

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OS-Copilot.

OS-Copilot: Autonomous OS Agent Framework

1. System Architecture and Components

2. Formalism and Self-Improvement Objective

3. FRIDAY: Agent Realization and Self-Directed Learning

4. OS Element Interaction and Control APIs

5. Benchmark Evaluation and Empirical Analysis

6. Algorithmic Workflow and Pseudocode

7. Limitations and Research Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

OS-Copilot: Autonomous OS Agent Framework

1. System Architecture and Components

2. Formalism and Self-Improvement Objective

3. FRIDAY: Agent Realization and Self-Directed Learning

4. OS Element Interaction and Control APIs

5. Benchmark Evaluation and Empirical Analysis

6. Algorithmic Workflow and Pseudocode

7. Limitations and Research Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research