ScaleCUA: Open-Source Cross-Platform GUI Agents

Updated 20 September 2025

ScaleCUA is an open-source initiative that enables autonomous GUI agents through a curated, cross-platform dataset spanning six major operating systems.
It employs a closed-loop data collection pipeline combining automated agent exploration with expert human annotations to ensure diverse, high-quality data.
The integrated vision-language models achieve state-of-the-art benchmarks in GUI understanding, grounding, and task planning, setting new performance standards.

ScaleCUA is an open-source initiative to enable robust, autonomous computer use agents (CUAs) by scaling training over a newly curated, cross-platform GUI dataset. The system brings together a comprehensive corpus across six operating systems and three principal task domains, paired with a closed-loop data collection pipeline integrating automated agents and human experts. The resulting models, trained with vision-language architectures and dataset-specific recipes, deliver high performance on GUI understanding, grounding, and task planning benchmarks. ScaleCUA sets new empirical standards in cross-device generalization and offers its data, models, and code to the public, thus facilitating further research in general-purpose computer use agents (Liu et al., 18 Sep 2025).

1. Dataset Construction and Corpus Statistics

ScaleCUA aggregates a large-scale dataset across six major operating systems: Windows, macOS, Linux, Android, iOS, and Web. The dataset is stratified across three domains: GUI Understanding, GUI Grounding, and Task Planning.

GUI Understanding: Approximately 471,000 examples are drawn from over 355,000 unique screenshots, targeting element appearance captioning, OCR referring, layout comprehension, functionality descriptions, interface overview, and screen transition analysis.
GUI Grounding: The corpus contains 17.1 million annotated instances, integrating element-level, action-level, and bounding box groundings, backed by metadata from A11y Trees, DOM (Web), XML (mobile), and proprietary vision parsing tools (e.g., OmniParser for Apple platforms).
Task Planning: This section comprises 19,000 trajectories—15,000 weak semantic traces from automated agent-driven random walks with heuristic pruning, and 4,000 human-curated goal-oriented demonstrations recorded in a unified cross-platform format.

The collection pipeline is hybrid: automated exploration captures diverse interaction states, while expert annotations enhance annotation quality and task diversity. All data are subsequently processed and annotated by advanced vision-LLMs (notably GPT-4o and Claude-3.7-Sonnet).

2. Closed-Loop Data Collection Pipeline

The ScaleCUA pipeline is dual-loop (termed “closed-loop”):

Agent-Environment Interaction Loop: Automated agents (both rule-driven and VLM-based) navigate the GUI environments to generate diverse trajectories and screenshot collections. Common strategies include random-walk exploration with segmenting and pruning via screenshot similarity metrics.
Agent-Human Hybrid Acquisition Loop: Human operators conduct platform-agnostic task demonstrations using a unified recording system, producing high-quality reference data to complement agent-generated traces.

This architecture ensures broad state space coverage and high annotation fidelity, which is especially critical for cross-system generalization.

3. Model Architecture, Training, and Agent Paradigms

ScaleCUA models are based on current vision-language architectures, with Qwen2.5-VL used as a backbone. Model training is customized per domain:

Grounding Mode: Screenshot and textual instruction input, predicting and localizing UI elements (coordinates or bounding boxes).
Direct Action Mode: Generation of immediate executable actions (tap, click, etc.), foregoing explicit reasoning.

Reasoned Action Mode: Inductive stepwise reasoning via chain-of-thought enclosed in > tags, followed by structured <operation> and <action> outputs. The agent transition at step

t

is modeled as:

$a_t = \pi_\theta(\text{task}, o_t, h_{<t})$

$o_{t+1} = \mathcal{E}(a_t)$

where $o_t$ is the raw screenshot or parsed metadata, $h_{<t}$ is interaction history, and $\mathcal{E}$ yields the environment’s response.

Training details include an initial learning rate of $1.0 \times 10^{-5}$ , maximum token length of 40,960, and scaling of general multimodal data (25–75%) for larger variant models (3B–32B). The data mixture ratio is tuned for balancing generic cross-modal knowledge with GUI-specific expertise.

4. Benchmark Results and Performance Evaluation

ScaleCUA models are evaluated on several standardized datasets and challenge benchmarks:

Benchmark Task ScaleCUA Top Score Baseline Gain

MMBench-GUI L1-Hard GUI Understanding 94.4% State-of-the-art

OSWorld-G GUI Grounding 60.6% Sets new standard

WebArena-Lite-v2 Task Completion 47.4% +26.6 over prior baseline

ScreenSpot-Pro GUI Grounding +10.7 pts Above native agents

Low-parameter models achieve >83% on easier tasks (MMBench-GUI L1-Easy), scaling up to >94% for hard tasks on the largest (32B) backbone. Improvements in grounding accuracy are consistent, with significant gains on cross-platform and cross-device evaluation, indicating strong representation learning and robust generalization.

5. Agent Action Modes and Task Modeling

ScaleCUA’s agent module supports three operational paradigms: grounding, direct action, and reasoned action. In reasoned action, the agent first generates an internal chain-of-thought reasoning trace (tagged as <think>...), then outputs <operation> and <action> tags describing the intended operation and executable system command, respectively. This design not only enriches supervision but also enhances interpretability for multi-step and compositional tasks.

Benchmark	Task	ScaleCUA Top Score	Baseline Gain
MMBench-GUI L1-Hard	GUI Understanding	94.4%	State-of-the-art
OSWorld-G	GUI Grounding	60.6%	Sets new standard
WebArena-Lite-v2	Task Completion	47.4%	+26.6 over prior baseline
ScreenSpot-Pro	GUI Grounding	+10.7 pts	Above native agents

All models are evaluated for both single-step UI manipulation and long-horizon multi-step plans in heterogeneous environments, showing state-of-the-art coverage and accuracy.

6. Implications, Applications, and Future Research

ScaleCUA demonstrates the effectiveness of large-scale, domain-specialized dataset construction in training cross-platform GUI agents. Key implications include:

Generalization: Cross-system corpora reduce overfitting to native environments, promoting robust, transferable agents for practical software automation.
Research Foundation: Data, checkpoints, and code are publicly released, establishing a benchmark and resource for subsequent work in agent-based GUI operation, multimodal planning, and hierarchical task reasoning.
Future Directions: Identified avenues include long-horizon planning, hierarchical memory structures, error recovery for perceptually invariant UI states, and integration of reinforcement learning for policy optimization. Adaptive tuning of data mixtures (GUI-specific vs. general multimodal) is a highlighted area for further scaling as model capacity grows.

7. Open Source and Community Contributions

ScaleCUA is fully open-source; researchers may access the dataset, trained checkpoints, and codebase at https://github.com/OpenGVLab/ScaleCUA. This ensures reproducibility and accelerates collective progress in CUA research, bridging the prior gap between general vision-language modeling and specific interactive software control.

In conclusion, ScaleCUA advances the field by systematically scaling both data and model architectures for general-purpose computer use agents, achieving state-of-the-art results while establishing a transparent and extensible platform for ongoing research and development (Liu et al., 18 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data (2025)

Follow Topic

Get notified by email when new papers are published related to ScaleCUA.

ScaleCUA: Open-Source Cross-Platform GUI Agents

1. Dataset Construction and Corpus Statistics

2. Closed-Loop Data Collection Pipeline

3. Model Architecture, Training, and Agent Paradigms

4. Benchmark Results and Performance Evaluation

5. Agent Action Modes and Task Modeling

6. Implications, Applications, and Future Research

7. Open Source and Community Contributions

Follow Topic

Continue Learning

Related Topics