Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1 4.1k

LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS (2505.18829v1)

Published 24 May 2025 in cs.AI, cs.HC, and cs.OS

Abstract: We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how LLMs understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that LLMs can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for LLMs represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.

PDF Abstract

The paper "LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS" (Mei et al., 24 May 2025 ) introduces AIOS 1.0, a platform designed to improve the capabilities of computer-use agents (CUAs). The core problem addressed is the semantic gap between how LLMs understand the world and how computer interfaces are structured. Instead of solely focusing on more powerful agent models or frameworks, AIOS 1.0 proposes transforming computers into contextual environments that LLMs can natively comprehend. This is achieved by implementing a Model Context Protocol (MCP) server architecture, which abstracts computer states and actions into LLM-friendly formats, primarily JSON schemas. This approach aims to decouple the complexity of the interface from the complexity of the agent's decision-making process.

To demonstrate the platform's effectiveness, the authors developed LiteCUA, a lightweight CUA built on AIOS 1.0. Despite its simple architecture, LiteCUA achieved a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks.

Architecture and System Design of AIOS 1.0

AIOS 1.0 extends the previous AIOS architecture, focusing on contextualizing the computer environment for LLM-based agents.

Application Layer: Provides SDK APIs for agents to interact with key computer components like the Terminal, Code Editor, Browser, and Document applications. It abstracts common interaction patterns, offering a consistent surface for agents. This layer also includes an MCP Client and an HTTP Client to map the agent's semantic understanding into computer manipulation commands.
Kernel Layer: The AIOS Kernel is enhanced with a redesigned Tool Manager that incorporates a Virtual Machine (VM) Controller and an MCP Server. This creates a sandboxed environment for safe agent interaction and maintains a consistent semantic mapping between agent intentions and computer operations.

Contextualizing Computers as MCP Servers

The central innovation of AIOS 1.0 is making the computer act as an MCP server. This involves:

Environment Perception Framework: A multi-modal system that captures the computer's state using:
- Screenshots: For visual information.
- Accessibility Tree (A11y tree): For structural information of on-screen elements.
- Mechanisms to inspect non-visible information, like software versions. This framework provides a comprehensive semantic representation of the computing environment.
Action Space Semantics: The agent's action space is defined by atomic computer operations such as CLICK, SCROLL, TYPE, and DRAG. These conceptual actions are then translated into specific GUI control signals, primarily using the pyautogui library.
VM Controller: Provides a sandboxed environment for the agent to operate in, preventing irreversible or harmful outcomes. It uses an HTTP interface for standardized communication with the VM, abstracting low-level operations.

LiteCUA: A Computer-Use Agent on AIOS 1.0

LiteCUA is a demonstration agent built to showcase AIOS 1.0's capabilities.

Architecture: It employs an orchestrator-worker architecture.
- Orchestrator: Manages central planning, task decomposition, and progress tracking.
- Worker Modules: Perform specialized functions:
- Perceptor: Integrates screenshots and A11y tree data to build a structured, semantic representation of the environment. For example, it might identify a button and its function (e.g., "Type: Button, Usage: Open the chrome").
- Reasoner: Acts as the cognitive core. It processes the perceived environmental information (textual descriptions and visual data from screenshots) in relation to task objectives, generating both an interpretation of the current state ("thought") and a plan for the next steps ("action").
- Actor: Translates the Reasoner's high-level action intentions (e.g., Click, Type) into specific GUI commands executed via pyautogui (e.g., pyautogui.click(100,100)).
Cognitive Cycle: LiteCUA follows a "perceive-reason-act" cycle, systematically processing environmental information to execute tasks.

Implementation and Practical Application

To implement a system like LiteCUA on AIOS 1.0, developers would:

Set up the AIOS 1.0 Environment: This involves deploying the AIOS kernel with its MCP server capabilities, likely within a VM for sandboxing.

Develop Agent Modules (Orchestrator, Perceptor, Reasoner, Actor):

Perceptor Implementation:

Capture screenshots (e.g., using OS-level tools or libraries like mss).
Extract A11y tree information (e.g., using platform-specific accessibility APIs like UI Automation for Windows, AT-SPI for Linux, or browser extensions for web content).

Process these inputs, potentially using a VLM or rule-based systems, to create a structured JSON representation of UI elements and their properties, as envisioned by the MCP.

// Example MCP-like representation for a UI element
{
  "element_id": "button_01",
  "type": "Button",
  "label": "Submit",
  "coordinates": [100, 200, 50, 30], // x, y, width, height
  "usage_hint": "Submits the current form", // Derived by LLM or heuristics
  "actions_available": ["CLICK"]
}

Reasoner Implementation:

This module would typically be an LLM (like GPT-4o as used in the paper).
Input to the LLM:
- Current task goal.
- History of previous actions and observations.
- The structured environment representation from the Perceptor.
- The raw screenshot (if the LLM is multimodal).
Output from the LLM:
- A "thought" process explaining its understanding and plan.
- A specific action to take, chosen from the defined action space (e.g., CLICK("button_01"), TYPE("text_input_01", "hello world")).

Prompting is crucial here. An example prompt structure:

You are a helpful AI assistant trying to accomplish a task on a computer.
Current Goal: {task_description}

Available Actions: CLICK(element_id), TYPE(element_id, text_to_type), SCROLL(direction, element_id_optional), DRAG(start_element_id, end_element_id), WAIT()

Previous Actions & Observations:
{history}

Current Screen Description (from A11y tree and perception):
{json_environment_representation}
Visual Screenshot: [ Attached Screenshot ]

Based on the goal and current screen, provide your thought process and the next action in JSON format:
{
  "thought": "I need to click the 'Submit' button to proceed.",
  "action": "CLICK(\"button_01\")"
}

Actor Implementation:

Takes the action decided by the Reasoner.

Uses a library like pyautogui to execute the action on the VM (e.g., mapping CLICK("button_01") to pyautogui.click(x, y) based on the coordinates from the Perceptor's output).

import pyautogui

def execute_action(action_command, environment_representation):
    action_type = action_command.split("(")[0]
    params_str = action_command[len(action_type)+1:-1]
    params = [p.strip().replace('"', '') for p in params_str.split(",")]

    if action_type == "CLICK":
        element_id = params[0]
        # Find element coordinates from environment_representation
        # x, y = get_element_coordinates(element_id, environment_representation)
        # pyautogui.click(x, y)
        pass # Simplified
    elif action_type == "TYPE":
        element_id = params[0]
        text_to_type = params[1]
        # Click on the element first if it's a text field
        # pyautogui.typewrite(text_to_type)
        pass # Simplified
    # ... implement other actions

Orchestrator Implementation: Manages the loop of perception, reasoning, and acting, keeps track of sub-tasks, and determines if the overall goal is met or if the agent is stuck.

Evaluation and Results

LiteCUA was evaluated on the OSWorld benchmark, achieving a 14.66% success rate.
This performance was modestly better than standalone LLMs (GPT-4o: 11.21%) and other agent frameworks like Friday (11.11%) and AgentStore (13.55%).
The results suggest that contextualizing the computer environment via AIOS 1.0 can improve CUA performance even with a simple agent architecture.
However, the overall success rates (all below 15%) highlight that complex computer tasks remain very challenging for current AI systems.
Performance breakdown showed LiteCUA performed better on OS-level tasks and VSCode tasks, but struggled significantly with applications like LibreOffice Calc and Thunderbird, likely due to more complex UIs and interaction patterns.

Conclusion and Future Vision

The paper concludes that AIOS 1.0's approach of environmental contextualization (treating the computer as an MCP server) is a promising direction for CUA development. It allows for simpler agent architectures while achieving competitive results. Future work includes:

Enhancing the perception framework to better understand temporal and causal relationships.
Developing more sophisticated action space semantics, possibly incorporating probabilistic reasoning.
Extending the approach to more specialized computing domains.

The overarching vision is to shift from adapting agents to human-designed interfaces towards adapting computer interfaces into environments that AI can more naturally comprehend, as a step towards more general AI systems interacting with the digital world.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kai Mei (30 papers)
Xi Zhu (35 papers)
Hang Gao (61 papers)
Shuhang Lin (9 papers)
Yongfeng Zhang (163 papers)

GitHub

GitHub - agiresearch/LiteCUA (4 stars)
GitHub - agiresearch/AIOS: AIOS: AI Agent Operating System (4,179 stars)

YouTube

Show All Videos