Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

UltraCUA: Unified Hybrid Action Framework

Updated 21 October 2025
  • UltraCUA is a unified framework integrating low-level GUI primitives and high-level tool calls to overcome execution bottlenecks in multimodal agents.
  • It employs a dual-agent system, where a Planner selects action modalities and a Grounder ensures precise GUI operations, reducing error propagation.
  • The framework uses automated tool acquisition and synthetic data generation with supervised and reinforcement learning to enhance cross-domain performance.

UltraCUA denotes a family of approaches and unified model frameworks for high-performance, hybrid-action computer-use agents, with specific reference to the foundation models for Computer Use Agents (CUAs) that integrate both low-level GUI primitives and high-level programmatic tool calls. UltraCUA addresses longstanding bottlenecks in multimodal computer-use agents, namely the reliance on fragile sequences of primitive interactions (click, type, scroll), which historically lead to compounding errors and inefficient executions. UltraCUA achieves improved robustness and efficiency by strategically alternating between direct visual actions and abstracted programmatic interface usage, supported by scalable data generation and a two-stage training framework.

1. Hybrid Action Architecture

UltraCUA is constructed on a hybrid action paradigm, in which the agent dynamically chooses between two categories of actions:

  • Low-level GUI primitives: Atomic visual operations such as click, type, or scroll, actuated via visual grounding and standard multimodal control.
  • High-level programmatic tools: Parameterized, abstracted operations implemented as Python functions with descriptive docstrings, corresponding to multi-step workflows or domain-specific actions (e.g., vscode.set_theme(new_theme)).

The action selection is orchestrated by a dual-agent system:

  • The Planner (built on a ReAct-style decision framework) selects the next action modality, employing reasoning and environmental cues.
  • The Grounder specializes in precise object/element localization for GUI primitive execution.

A persistent working memory mechanism (delimited by <memory> tags) records current progress, intermediate values, and execution context, mediating coherent stateful behavior across low-level and high-level action invocations.

In online reinforcement learning (RL), the reward function for a trajectory τ\tau is formalized as:

R(τ)=Renv(τ)+Rtool(τ)R(\tau) = R_{\text{env}}(\tau) + R_{\text{tool}}(\tau)

where RenvR_{\text{env}} is a sparse environment reward (success: 1; failure: –1), and RtoolR_{\text{tool}} is a positive bonus (e.g., 0.3) for successfully incorporating tool calls into solutions. This schema biases training toward adaptive hybrid action strategies.

2. Automated Programmatic Tool Acquisition

A central feature is the automated and scalable pipeline for harvesting high-level tools from diverse programmatic sources:

  • Software Documentation Extraction: Parsing of official or community-provided documentation to enumerate expert workflows (such as keyboard shortcuts) and translate them into function-form tool APIs (e.g., mapping “Ctrl+K, Ctrl+T” to vscode.set_theme).
  • Open-Source Repository Integration: Direct inclusion of tool implementations from established multi-agent and workflow repositories (such as AgentS2 and AgentStore), substituting GUI choreographies with single tool calls when appropriate.
  • Automated Code Generation: Multi-agent coding paradigms (inspired by CoACT-1) are employed, with a coding agent executing and tracing scripts to mine new, reusable tools. Unit tests and reflection validate the correctness and generality of these artifacts.

Through these mechanisms, UltraCUA acquires an extensive toolbox encompassing hundreds of high-quality, domain-targeted operations, vastly extending the agent’s operational vocabulary beyond primitive GUI interactions.

3. Synthetic Data Generation

UltraCUA’s training corpus is synthesized to reflect realistic and verifiable computer-use scenarios, encompassing over 17,000 tasks. The synthetic data engine leverages two complementary methods:

  • Evaluator-First: Starts from state evaluators (e.g., file existence, browser state) and prompts LLMs to devise instructions that realize those states, ensuring unambiguous success criteria.
  • Instruction-First: Initiates from open-ended UI exploration, with instructions generated based on encountered UI situations, yielding diverse and context-sensitive data reflective of practical workflows.

This dual strategy covers both deterministic tasks (with binary verifiability) and open-ended, exploratory multi-step routines, across domains such as office productivity, system utilities, and web browsing.

4. Training Paradigm: Supervised and Reinforcement Learning

UltraCUA is trained in a two-stage sequence:

  • Stage 1 – Supervised Fine-Tuning (SFT): The model is exposed to 26.8K curated hybrid trajectories, each containing both low-level GUI and high-level tool actions. The training regime emphasizes equal treatment of decision splits—ensuring that the agent learns both when and how to switch between modalities.
  • Stage 2 – Online RL: Subsequent RL phase optimizes the agent’s strategic selection behavior in the synthesized task environment. The reward structure, as formulated above, encourages correct and efficient use of high-level tools, discouraging unnecessarily lengthy GUI-only action chains. RL induces exploratory behaviors that reduce both redundant steps and error accumulation from imprecise visual actions, which are intrinsic risks in purely primitive-action agents.

5. Empirical Performance and Generalization

UltraCUA demonstrates substantial empirical advances as measured on standard and out-of-domain environments:

Benchmark UltraCUA-7B/32B Base Models Remarks
OSWorld +22% avg. improvement Baseline multimodal Also 11% fewer execution steps
WinAgentArena 21.7% success rate Lower Trained on Ubuntu, tested on Win

Performance gains are attributable to both the efficiency (fewer steps) and reliability (higher overall task success rate) conferred by hybrid action. Out-of-domain evaluation—using models trained exclusively on Ubuntu data—exhibits robust cross-OS generalization, suggesting the hybrid action and tool abstraction strategies are not system-specific.

6. Error Propagation and Execution Robustness

UltraCUA’s hybrid paradigm critically mitigates error propagation, an endemic weakness of GUI-only agents:

  • Multi-step tool invocations (e.g., cell value updates, settings changes) encapsulate complex actions, minimizing the number of action-step opportunities for error and obviating fragile sequential visual manipulations.
  • Working memory consistently tracks and exposes intermediate state, reducing the chance of state-misalignment and redundant actions.
  • RL-derived policies reduce unnecessary or inappropriate tool calls by approximately 46%, concentrating tool use on tasks where high-level abstraction aids robustness.

This operational reliability is especially notable in tasks where error in a single action (such as a mis-click) would otherwise irreversibly derail the entire solution trajectory.

7. Broader Impact and Future Directions

UltraCUA establishes a foundational design pattern for future CUAs by demonstrating the viability and empirical advantages of hybrid action selection, automated tool discovery, and unified training on diverse, realistic datasets. The approach facilitates agents that are not only faster and more robust, but also capable of more naturalistic and contextually appropriate computer use in complex real-world settings.

A plausible implication is that, by extending hybrid action frameworks and further refining automated tool mining (potentially leveraging LLM-powered code synthesis and domain adaptation), subsequent generations of computer-use agents may continue to close the gap between LLM multimodal interface control and advanced programmatic automation. Open directions include seamless adaptation to new or evolving software UIs, further error-reduction mechanisms, and the harmonization of hybrid action with additional agentic faculties such as long-horizon planning or real-time collaboration.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UltraCUA.