The paper "LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS" (Mei et al., 24 May 2025 ) introduces AIOS 1.0, a platform designed to improve the capabilities of computer-use agents (CUAs). The core problem addressed is the semantic gap between how LLMs understand the world and how computer interfaces are structured. Instead of solely focusing on more powerful agent models or frameworks, AIOS 1.0 proposes transforming computers into contextual environments that LLMs can natively comprehend. This is achieved by implementing a Model Context Protocol (MCP) server architecture, which abstracts computer states and actions into LLM-friendly formats, primarily JSON schemas. This approach aims to decouple the complexity of the interface from the complexity of the agent's decision-making process.
To demonstrate the platform's effectiveness, the authors developed LiteCUA, a lightweight CUA built on AIOS 1.0. Despite its simple architecture, LiteCUA achieved a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks.
Architecture and System Design of AIOS 1.0
AIOS 1.0 extends the previous AIOS architecture, focusing on contextualizing the computer environment for LLM-based agents.
- Application Layer: Provides SDK APIs for agents to interact with key computer components like the Terminal, Code Editor, Browser, and Document applications. It abstracts common interaction patterns, offering a consistent surface for agents. This layer also includes an MCP Client and an HTTP Client to map the agent's semantic understanding into computer manipulation commands.
- Kernel Layer: The AIOS Kernel is enhanced with a redesigned Tool Manager that incorporates a Virtual Machine (VM) Controller and an MCP Server. This creates a sandboxed environment for safe agent interaction and maintains a consistent semantic mapping between agent intentions and computer operations.
Contextualizing Computers as MCP Servers
The central innovation of AIOS 1.0 is making the computer act as an MCP server. This involves:
- Environment Perception Framework: A multi-modal system that captures the computer's state using:
- Screenshots: For visual information.
- Accessibility Tree (A11y tree): For structural information of on-screen elements.
- Mechanisms to inspect non-visible information, like software versions. This framework provides a comprehensive semantic representation of the computing environment.
- Action Space Semantics: The agent's action space is defined by atomic computer operations such as
CLICK
,SCROLL
,TYPE
, andDRAG
. These conceptual actions are then translated into specific GUI control signals, primarily using thepyautogui
library. - VM Controller: Provides a sandboxed environment for the agent to operate in, preventing irreversible or harmful outcomes. It uses an HTTP interface for standardized communication with the VM, abstracting low-level operations.
LiteCUA: A Computer-Use Agent on AIOS 1.0
LiteCUA is a demonstration agent built to showcase AIOS 1.0's capabilities.
- Architecture: It employs an orchestrator-worker architecture.
- Orchestrator: Manages central planning, task decomposition, and progress tracking.
- Worker Modules: Perform specialized functions:
- Perceptor: Integrates screenshots and A11y tree data to build a structured, semantic representation of the environment. For example, it might identify a button and its function (e.g., "Type: Button, Usage: Open the chrome").
- Reasoner: Acts as the cognitive core. It processes the perceived environmental information (textual descriptions and visual data from screenshots) in relation to task objectives, generating both an interpretation of the current state ("thought") and a plan for the next steps ("action").
- Actor: Translates the Reasoner's high-level action intentions (e.g., Click, Type) into specific GUI commands executed via
pyautogui
(e.g.,pyautogui.click(100,100)
).
- Cognitive Cycle: LiteCUA follows a "perceive-reason-act" cycle, systematically processing environmental information to execute tasks.
Implementation and Practical Application
To implement a system like LiteCUA on AIOS 1.0, developers would:
- Set up the AIOS 1.0 Environment: This involves deploying the AIOS kernel with its MCP server capabilities, likely within a VM for sandboxing.
- Develop Agent Modules (Orchestrator, Perceptor, Reasoner, Actor):
- Perceptor Implementation:
- Capture screenshots (e.g., using OS-level tools or libraries like
mss
). - Extract A11y tree information (e.g., using platform-specific accessibility APIs like UI Automation for Windows, AT-SPI for Linux, or browser extensions for web content).
- Process these inputs, potentially using a VLM or rule-based systems, to create a structured JSON representation of UI elements and their properties, as envisioned by the MCP.
1 2 3 4 5 6 7 8 9
// Example MCP-like representation for a UI element { "element_id": "button_01", "type": "Button", "label": "Submit", "coordinates": [100, 200, 50, 30], // x, y, width, height "usage_hint": "Submits the current form", // Derived by LLM or heuristics "actions_available": ["CLICK"] }
- Capture screenshots (e.g., using OS-level tools or libraries like
- Reasoner Implementation:
- This module would typically be an LLM (like GPT-4o as used in the paper).
- Input to the LLM:
- Current task goal.
- History of previous actions and observations.
- The structured environment representation from the Perceptor.
- The raw screenshot (if the LLM is multimodal).
- Output from the LLM:
- A "thought" process explaining its understanding and plan.
- A specific action to take, chosen from the defined action space (e.g.,
CLICK("button_01")
,TYPE("text_input_01", "hello world")
).
- Prompting is crucial here. An example prompt structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
You are a helpful AI assistant trying to accomplish a task on a computer. Current Goal: {task_description} Available Actions: CLICK(element_id), TYPE(element_id, text_to_type), SCROLL(direction, element_id_optional), DRAG(start_element_id, end_element_id), WAIT() Previous Actions & Observations: {history} Current Screen Description (from A11y tree and perception): {json_environment_representation} Visual Screenshot: [ Attached Screenshot ] Based on the goal and current screen, provide your thought process and the next action in JSON format: { "thought": "I need to click the 'Submit' button to proceed.", "action": "CLICK(\"button_01\")" }
- Actor Implementation:
- Takes the action decided by the Reasoner.
- Uses a library like
pyautogui
to execute the action on the VM (e.g., mappingCLICK("button_01")
topyautogui.click(x, y)
based on the coordinates from the Perceptor's output).1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
import pyautogui def execute_action(action_command, environment_representation): action_type = action_command.split("(")[0] params_str = action_command[len(action_type)+1:-1] params = [p.strip().replace('"', '') for p in params_str.split(",")] if action_type == "CLICK": element_id = params[0] # Find element coordinates from environment_representation # x, y = get_element_coordinates(element_id, environment_representation) # pyautogui.click(x, y) pass # Simplified elif action_type == "TYPE": element_id = params[0] text_to_type = params[1] # Click on the element first if it's a text field # pyautogui.typewrite(text_to_type) pass # Simplified # ... implement other actions
- Orchestrator Implementation: Manages the loop of perception, reasoning, and acting, keeps track of sub-tasks, and determines if the overall goal is met or if the agent is stuck.
- Perceptor Implementation:
Evaluation and Results
- LiteCUA was evaluated on the OSWorld benchmark, achieving a 14.66% success rate.
- This performance was modestly better than standalone LLMs (GPT-4o: 11.21%) and other agent frameworks like Friday (11.11%) and AgentStore (13.55%).
- The results suggest that contextualizing the computer environment via AIOS 1.0 can improve CUA performance even with a simple agent architecture.
- However, the overall success rates (all below 15%) highlight that complex computer tasks remain very challenging for current AI systems.
- Performance breakdown showed LiteCUA performed better on OS-level tasks and VSCode tasks, but struggled significantly with applications like LibreOffice Calc and Thunderbird, likely due to more complex UIs and interaction patterns.
Conclusion and Future Vision
The paper concludes that AIOS 1.0's approach of environmental contextualization (treating the computer as an MCP server) is a promising direction for CUA development. It allows for simpler agent architectures while achieving competitive results. Future work includes:
- Enhancing the perception framework to better understand temporal and causal relationships.
- Developing more sophisticated action space semantics, possibly incorporating probabilistic reasoning.
- Extending the approach to more specialized computing domains.
The overarching vision is to shift from adapting agents to human-designed interfaces towards adapting computer interfaces into environments that AI can more naturally comprehend, as a step towards more general AI systems interacting with the digital world.