OpenCUA Framework Overview

Updated 19 August 2025

OpenCUA Framework is an open-source system for developing and evaluating computer-use agents that automate tasks via graphical interfaces across multiple operating systems.
It employs advanced techniques such as chain-of-thought enrichment, state–action pair encoding, and multi-modal supervision to achieve state-of-the-art performance.
The framework supports scalable data processing, robust cross-domain generalization, and integration with personalized computer-using agents for dynamic automation.

OpenCUA is an open-source, extensible framework for the development and evaluation of Computer-Use Agents (CUAs)—vision-LLMs designed to automate diverse computer tasks through interactions with graphical user interfaces across operating systems and application domains. It provides the research community with a comprehensive toolchain for capturing, processing, training, and benchmarking CUAs, aiming to foster transparency, longevity, and replicable advances in agent modeling. OpenCUA leverages large-scale demonstration data, chain-of-thought enrichment, and foundation models to establish new state-of-the-art performance among open-source CUA systems and enables integration with Computer-Using Personal Agents (CUPAs) for personalized automation and policy-controlled data access.

1. Framework Architecture and Annotation Infrastructure

The core architecture of OpenCUA consists of three principal components: an annotation infrastructure, the AgentNet dataset, and a scalable pipeline for data processing and model training.

Annotation Infrastructure: The annotation component centers on the AgentNet Tool, which records natural human-computer interactions unobtrusively on Windows, macOS, and Ubuntu platforms. The tool captures screen recordings, keyboard and mouse actions, and accessibility tree (Axtree) data, ensuring comprehensive contextual coverage. This infrastructure supports both high-fidelity demonstration capture and minimal workflow disruption, serving as a foundation for curating high-quality agent learning trajectories.
Data Trajectory Generation: Raw signals comprising thousands of UI events are compressed into semantically meaningful “state–action pairs” through an action reduction pipeline. Each trajectory is further augmented using reflective long chain-of-thought (CoT) reasoning, structuring the learning process as a hierarchy from L3 (Observation) → L2 (Reflective Reasoning) → L1 (Concise Action):

$\text{Observation (L3)} \rightarrow \text{Reflective Reasoning (L2)} \rightarrow \text{Concise Action (L1)}$

This structured annotation enables robust contextual understanding, explicit internal planning, and error detection, resulting in cohesive executable agent instructions.

The AgentNet dataset is derived from annotated demonstrations and consists of over 22,600 curated computer-use task trajectories spanning more than 200 applications and websites across three operating systems.

State–Action Pair Encoding: Each entry comprises compact state–action pairs $(s_i, a_i)$ reflecting reduced and semantically organized user interface operations.
Chain-of-Thought Enrichment: Long CoT annotations are applied, providing explicit reflection on GUI contexts, reasoning steps, and decision points. Specialized modules (reflector, generator, summarizer) are used to capture planning and error correction, enabling richer supervision signals during training.
Multi-Modal Supervision: The pipeline incorporates both domain-specific GUI data and generic text/vision features to maximize generalization and robustness during supervised fine-tuning of agent models.

3. Benchmarking and Performance Evaluation

Comprehensive evaluation metrics are reported for model performance, following protocols on the OSWorld-Verified benchmark platform:

Success Rates: OpenCUA-32B attains an average success rate of 34.8% on OSWorld-Verified under a 100-step test-time budget.
Comparative Analysis: Relative to proprietary models such as OpenAI CUA (GPT-4o, 31.4% at 100 steps) and Claude 4 Sonnet, OpenCUA-32B establishes a new state-of-the-art for open-source systems and even surpasses proprietary agents in several key evaluations.
Pass@n Scores: Agent performance increases when multiple candidate trajectories are generated, with even higher success rates observed via Pass@n evaluation protocols.

The following table summarizes agent success rates across key benchmarks:

Model	OSWorld-Verified (100 steps)	Pass@n Improvements
OpenCUA-32B	34.8%	Substantial
OpenAI CUA	31.4%	Substantial

4. Generalization, Scalability, and Cross-Domain Robustness

OpenCUA agents demonstrate robust cross-domain generalization and scalability across diverse environments.

Cross-domain Generalization: The inclusion of out-of-domain data (e.g., Windows/macOS and Ubuntu) during training leads to measurable improvements in downstream agent performance, indicating effective transfer and adaptability across GUI types and workflows.
Scalability with Data and Computation: Performance scales positively with dataset size and increased test-time computation (step budget), permitting longer action planning horizons and agent self-correction, though marginal utility diminishes at higher budgets. Ablation studies further show the benefits of non-deterministic (nonzero temperature) decoding during inference.

5. Open Source Contributions and Integration with CUPA Agents

OpenCUA is designed to support open research and methodological advancement in CUA modeling.

Released Resources: The framework includes public releases of the AgentNet Tool, the full AgentNet dataset, pipeline code (action reduction, CoT synthesis modules, fine-tuning recipes), and pretrained models (e.g., OpenCUA-7B, OpenCUA-32B).
Interoperability with CUPAs: The OpenCUA framework can be extended to interface with Computer-Using Personal Agents (CUPAs), which allow agents to access user Personal Knowledge Graphs (PKGs) via policy-controlled APIs (Bonatti et al., 31 Jan 2025). This integration enables personalized recommendations, enriched automation (e.g., travel booking using PKG for frequent flyer numbers, seating preferences), and collaborative task-solving through multi-agent negotiation and risk/reward simulation. Policy enforcement (such as access control via ODRL semantics) is facilitated via embedded microservices.

6. Methodological Advances and Identified Limitations

OpenCUA introduces methodological innovations and also highlights current limitations for future work.

Algorithmic Enhancements: The pipeline employs multi-stage CoT reflection, entity extraction, semantic matching, and supervised fine-tuning on mixed modality data. For CUPA integration, self-improvement, negotiation, and formal policy enforcement algorithms are embedded as modular services.
Limitations: The scalability of AgentNet is constrained by manual annotation requirements. Error propagation remains a challenge in long-horizon tasks, and agent robustness is limited under environmental perturbations, as illustrated by variation in Pass@n success rates. Safety and ethical risks must be addressed, given agent autonomy in executing consequential actions on users’ behalf.

7. Future Directions and Impact

OpenCUA establishes an open baseline for CUA development and reproducible evaluation.

It encourages exploration into advanced error recovery, dynamic multi-modal reasoning, and improved long-horizon planning.
The open resources provided lower the barriers for experimentation, comparative analysis, and integration with policy-aware personal agent systems.
This suggests that OpenCUA will catalyze advances in agent robustness, personalized data governance, and collaborative digital task solving, thereby expanding the practical and theoretical scope of computer-use agents in both academic and commercial domains.

PDF Markdown Chat (Pro)

References (1)

Towards Computer-Using Personal Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OpenCUA Framework.

OpenCUA Framework Overview

1. Framework Architecture and Annotation Infrastructure

2. AgentNet Dataset Construction and Multi-Modal Enrichment