ScaleCUA: Open-Source Cross-Platform GUI Agents
- ScaleCUA is an open-source initiative that enables autonomous GUI agents through a curated, cross-platform dataset spanning six major operating systems.
- It employs a closed-loop data collection pipeline combining automated agent exploration with expert human annotations to ensure diverse, high-quality data.
- The integrated vision-language models achieve state-of-the-art benchmarks in GUI understanding, grounding, and task planning, setting new performance standards.
ScaleCUA is an open-source initiative to enable robust, autonomous computer use agents (CUAs) by scaling training over a newly curated, cross-platform GUI dataset. The system brings together a comprehensive corpus across six operating systems and three principal task domains, paired with a closed-loop data collection pipeline integrating automated agents and human experts. The resulting models, trained with vision-language architectures and dataset-specific recipes, deliver high performance on GUI understanding, grounding, and task planning benchmarks. ScaleCUA sets new empirical standards in cross-device generalization and offers its data, models, and code to the public, thus facilitating further research in general-purpose computer use agents (Liu et al., 18 Sep 2025).
1. Dataset Construction and Corpus Statistics
ScaleCUA aggregates a large-scale dataset across six major operating systems: Windows, macOS, Linux, Android, iOS, and Web. The dataset is stratified across three domains: GUI Understanding, GUI Grounding, and Task Planning.
- GUI Understanding: Approximately 471,000 examples are drawn from over 355,000 unique screenshots, targeting element appearance captioning, OCR referring, layout comprehension, functionality descriptions, interface overview, and screen transition analysis.
- GUI Grounding: The corpus contains 17.1 million annotated instances, integrating element-level, action-level, and bounding box groundings, backed by metadata from A11y Trees, DOM (Web), XML (mobile), and proprietary vision parsing tools (e.g., OmniParser for Apple platforms).
- Task Planning: This section comprises 19,000 trajectories—15,000 weak semantic traces from automated agent-driven random walks with heuristic pruning, and 4,000 human-curated goal-oriented demonstrations recorded in a unified cross-platform format.
The collection pipeline is hybrid: automated exploration captures diverse interaction states, while expert annotations enhance annotation quality and task diversity. All data are subsequently processed and annotated by advanced vision-LLMs (notably GPT-4o and Claude-3.7-Sonnet).
2. Closed-Loop Data Collection Pipeline
The ScaleCUA pipeline is dual-loop (termed “closed-loop”):
- Agent-Environment Interaction Loop: Automated agents (both rule-driven and VLM-based) navigate the GUI environments to generate diverse trajectories and screenshot collections. Common strategies include random-walk exploration with segmenting and pruning via screenshot similarity metrics.
- Agent-Human Hybrid Acquisition Loop: Human operators conduct platform-agnostic task demonstrations using a unified recording system, producing high-quality reference data to complement agent-generated traces.
This architecture ensures broad state space coverage and high annotation fidelity, which is especially critical for cross-system generalization.
3. Model Architecture, Training, and Agent Paradigms
ScaleCUA models are based on current vision-language architectures, with Qwen2.5-VL used as a backbone. Model training is customized per domain:
- Grounding Mode: Screenshot and textual instruction input, predicting and localizing UI elements (coordinates or bounding boxes).
- Direct Action Mode: Generation of immediate executable actions (tap, click, etc.), foregoing explicit reasoning.
- Reasoned Action Mode: Inductive stepwise reasoning via chain-of-thought enclosed in > tags, followed by structured <operation> and <action> outputs. The agent transition at step is modeled as:
where is the raw screenshot or parsed metadata, is interaction history, and yields the environment’s response.
Training details include an initial learning rate of , maximum token length of 40,960, and scaling of general multimodal data (25–75%) for larger variant models (3B–32B). The data mixture ratio is tuned for balancing generic cross-modal knowledge with GUI-specific expertise.
4. Benchmark Results and Performance Evaluation
ScaleCUA models are evaluated on several standardized datasets and challenge benchmarks:
Benchmark Task ScaleCUA Top Score Baseline Gain MMBench-GUI L1-Hard GUI Understanding 94.4% State-of-the-art OSWorld-G GUI Grounding 60.6% Sets new standard WebArena-Lite-v2 Task Completion 47.4% +26.6 over prior baseline ScreenSpot-Pro GUI Grounding +10.7 pts Above native agents Low-parameter models achieve >83% on easier tasks (MMBench-GUI L1-Easy), scaling up to >94% for hard tasks on the largest (32B) backbone. Improvements in grounding accuracy are consistent, with significant gains on cross-platform and cross-device evaluation, indicating strong representation learning and robust generalization.
5. Agent Action Modes and Task Modeling
ScaleCUA’s agent module supports three operational paradigms: grounding, direct action, and reasoned action. In reasoned action, the agent first generates an internal chain-of-thought reasoning trace (tagged as <think>...), then outputs <operation> and <action> tags describing the intended operation and executable system command, respectively. This design not only enriches supervision but also enhances interpretability for multi-step and compositional tasks.
All models are evaluated for both single-step UI manipulation and long-horizon multi-step plans in heterogeneous environments, showing state-of-the-art coverage and accuracy.
6. Implications, Applications, and Future Research
ScaleCUA demonstrates the effectiveness of large-scale, domain-specialized dataset construction in training cross-platform GUI agents. Key implications include:
Generalization: Cross-system corpora reduce overfitting to native environments, promoting robust, transferable agents for practical software automation.
- Research Foundation: Data, checkpoints, and code are publicly released, establishing a benchmark and resource for subsequent work in agent-based GUI operation, multimodal planning, and hierarchical task reasoning.
- Future Directions: Identified avenues include long-horizon planning, hierarchical memory structures, error recovery for perceptually invariant UI states, and integration of reinforcement learning for policy optimization. Adaptive tuning of data mixtures (GUI-specific vs. general multimodal) is a highlighted area for further scaling as model capacity grows.
7. Open Source and Community Contributions
ScaleCUA is fully open-source; researchers may access the dataset, trained checkpoints, and codebase at https://github.com/OpenGVLab/ScaleCUA. This ensures reproducibility and accelerates collective progress in CUA research, bridging the prior gap between general vision-LLMing and specific interactive software control.
In conclusion, ScaleCUA advances the field by systematically scaling both data and model architectures for general-purpose computer use agents, achieving state-of-the-art results while establishing a transparent and extensible platform for ongoing research and development (Liu et al., 18 Sep 2025).