Hybrid GUI+API Action Space

Updated 9 November 2025

Hybrid GUI+API action space is a dual-modality framework that combines low-level GUI interactions with high-level API commands to control software environments.
The framework enables agents to dynamically select between direct GUI actions and robust API calls, leveraging both modalities’ strengths for improved task performance.
Empirical benchmarks demonstrate significant improvements in success rates and step efficiency when integrating GUI primitives with API operations in agent architectures.

A hybrid GUI+API action space refers to an agent action paradigm in which both low-level, visually-interacting GUI primitives (such as clicks, keypresses, and drags over screen elements) and high-level, programmatic API (or tool) calls are simultaneously available. This design allows intelligent agents—typically those orchestrated by LLMs or RL agents—to dynamically determine, at each step of interacting with a software environment, whether to issue a primitive direct GUI action or a native API command, leveraging the strengths of both modalities. Hybrid action spaces are increasingly central in computer-using agent frameworks and are recognized for combining the reliability/efficiency of APIs with the generality and universality of GUI-level control.

1. Formal Action Space Definition

The hybrid GUI+API space is universally characterized by the union of two disjoint action sets:

$A = A_\text{GUI} \,\cup\, A_\text{API}$

$A_\text{GUI}$ consists of low-level interface actions such as

$\{ \text{click}(c),~\text{type}(t),~\text{hotkey}(k),~\text{scroll}(\Delta),~\text{drag}(c_1,c_2) \mid c,c_1,c_2\in C,~t\in\Sigma^*,~k\in K,~\Delta\in\mathbb{Z} \}$

where $C$ is the set of detected UI controls, $\Sigma^*$ is the set of strings, and $K$ is the set of keyboard shortcuts.

$A_\text{API}$ is the set of registered, high-level application-native API operations:

$A_\text{API} = \{ f_i(\theta_i)~|~f_i \text{ is a named API endpoint},~\theta_i \text{ its parameter set} \}$

At each time $t$ , the agent, observing system state $s_t$ , selects $a_t \in A$ to induce a transition $T(s_t, a_t) = s_{t+1}$ (or $\perp$ on failure), as in $T: S \times A \to S \cup \{\perp\}$ (Zhang et al., 20 Apr 2025).

This structure generalizes across environments: GUI primitives enable general access (regardless of program coverage), while APIs—when present—serve as robust, efficient alternatives. For example, CoAct-1 extends the hybrid set with $A_\text{code}$ , denoting scripted Python/Bash execution (Song et al., 5 Aug 2025). Mobile agent work (MAS-Bench) incorporates “shortcuts” (API, deep-link, RPA) along with GUI “Tap,” “Swipe,” and navigation primitives (Zhao et al., 8 Sep 2025).

Quantitatively, in benchmarks such as GUI-360, the primitive pool is explicitly cataloged (e.g., 4 GUI primitives and 14 app-specific APIs, totaling 18 distinct function names with parameter schemas) (Mu et al., 6 Nov 2025).

2. System Architectures and Unified Execution Layer

Modern hybrid agents employ modular architectures to support dynamic interleaving of GUI and API actions. UFO2 exemplifies this pattern with its HostAgent–AppAgent–Puppeteer model (Zhang et al., 20 Apr 2025):

HostAgent orchestrates multi-step task decomposition, agent spawning per application, and maintains a global blackboard.
AppAgents are application-specific workers that:
- Fuse state perception via both Windows UI Automation (UIA) and vision-based parsing.
- Engage in iterative ReAct loops (LLM plan, execute by Puppeteer, update local FSM).
- Invoke Puppeteer, which exposes a single Execute(a) primitive:
- If the requested semantic action matches a registered API, it issues the API call.
- Otherwise, it falls back to a GUI action sequence.

Each action is represented by a JSON record specifying application, control identifier (if applicable), action type (API or GUI), and arguments.

Both levels maintain minimal finite-state machines (CONTINUE, PENDING, FINISH, FAIL) for safe preemption and recovery. These principles are mirrored in CoAct-1, where the Orchestrator agent delegates subtasks to GUI Operators or Programmer agents (for code execution), and in MCPWorld with a "unified tool-based space" where the LLM planner calls GUI or API actions at each step (Yan et al., 9 Jun 2025, Song et al., 5 Aug 2025).

Architectural Pattern Comparison

Paradigm	Description	Example Papers
HostAgent–AppAgent split	System-level orchestrator, app-specialized executors	(Zhang et al., 20 Apr 2025)
Orchestrator and Toolset	Registry of both GUI/API “tools” available at planning	(Zhang et al., 14 Mar 2025)
ReAct or LLM loop	LLM alternates between GUI and API invocation by prompt	(Song et al., 2024, Yang et al., 20 Oct 2025)

A central property is that the agent can, for each subtask, select between an API if present (favoring atomicity and efficiency) and a GUI fallback if not, enabling robust cross-application coverage.

3. Control and Perception: Hybrid Detection and Grounding

Hybrid agents require cross-modal perception to ground GUI primitives and APIs. UFO2 integrates two detection streams for GUI controls (Zhang et al., 20 Apr 2025):

UIA Layer: Enumerates standard Windows controls (C_\text{UIA}) via accessibility API.
Vision Layer: Detects visually-drawn (custom) controls from screenshots using object detectors (YOLOv8+Florence2).

Fusion is achieved by combining $C_\text{UIA}$ with non-overlapping vision detections, using IoU-based deduplication (threshold $\tau = 0.1$ ). The resulting control set $C$ is annotated and passed into the LLM for grounding, ensuring generalization across both standard and bespoke UI elements.

Similarly, MAS-Bench and GUI-360 jointly use screenshot, accessibility, and tree information for grounding (Zhao et al., 8 Sep 2025, Mu et al., 6 Nov 2025). In MCPWorld, hybrid state observations comprise both pixel-level screenshots/UI trees and API return values; the same applies to UltraCUA, which accepts screenshots plus tool signatures per step (Yang et al., 20 Oct 2025).

4. Planning, Scheduling, and Decision Criteria

Efficient utilization of the hybrid space requires task-level planning and robust policies for when to select API over GUI (or vice versa). Several strategies are empirically instantiated:

Speculative Multi-Action Batching (UFO2): AppAgents predict a batch of $k$ actions per LLM call, executing them sequentially as long as GUI control preconditions are satisfied, exiting early if any control is stale. This reduces LLM calls by up to 50% at no significant loss in success rate (Zhang et al., 20 Apr 2025).
Heuristic-driven Decision (API vs. GUI) (Zhang et al., 14 Mar 2025): An agent compares scores for candidate actions based on modeled efficiency, latency, reliability, security risk, and brittleness, favoring APIs below a latency (e.g., $<500$ ms) or where API reliability $>90\%$ , falling back to GUI where coverage is incomplete.
Orchestrator Delegation (CoAct-1): An explicit delegation policy $\pi_\text{delegate}(d|s_t)$ is learned, deciding whether a subgoal is best attempted by GUI or code.
MAS-Bench Shortcut Generation: Agents may generate macros (new shortcut APIs) from prior successful action trajectories by mining frequent subsequences or semantically annotating them with selectors (Zhao et al., 8 Sep 2025).

Decision logic is commonly implemented both in LLM prompts (structured tool schemas, explicit tagging of action types) and in the policy architecture (action-type softmax).

5. Empirical Effects and Quantitative Outcomes

Hybrid action spaces are consistently associated with substantial improvements in agent performance, as demonstrated across several domains:

Desktop Automation (UFO2, UltraCUA, CoAct-1, ComputerRL)
- UFO2: GUI+API integration increases OSWorld-W office SR from 16.3% (GUI-only) to 24.5% (GUI+API), with average step count dropping from 13.8 to 6.6 (Zhang et al., 20 Apr 2025).
- UltraCUA (OSWorld): Relative gains over GUI-only base of +23% for 7B (27.0% vs 23.4%) and +22.9% for 32B (41.0% vs 33.3%); efficiency gains of 11% fewer steps (Yang et al., 20 Oct 2025).
- CoAct-1: SR improves from 53.1% (GTA-1 baseline) to 60.76% (hybrid), with 33% fewer steps per solved task (Song et al., 5 Aug 2025).
- ComputerRL: GUI-only agent achieves 11.2% SR, hybrid (API+GUI) achieves 26.2%; office domain increases from 6.2% to 27.9% (Lai et al., 19 Aug 2025).
Mobile Agents (MAS-Bench):
- Single-app tasks: Hybrid agent increases SR from 0.511 (T3A) to 0.641 (MAS-MobileAgent), a ΔSR of +0.195, with mean step ratio dropping from 1.056 to 0.613, and token cost dropping from 346K to 99K (Zhao et al., 8 Sep 2025).
- Cross-app: Hybrid agent improves SR from 0.340 to 0.617.
Web Automation (Beyond Browsing):
- Browsing-only agent: 14.8% SR; API-only: 29.2%; hybrid: 35.8% (+6.6pp vs API) (Song et al., 2024).
Ablation results (GUI-360):
- With 19% API use rate, many long-horizon tasks are completed in fewer steps and with higher semantic precision (Mu et al., 6 Nov 2025).

Performance metrics used include Success Rate (SR), Average Steps (AS or MS), action and function accuracy, mean execution time (MET), and tool/call diversity.

6. Implementation, Training Objectives, and Policy Representation

Hybrid action spaces are typically backed by unified policy representations and model heads, which handle both discrete (action type) and continuous (arguments/coordinates/text) facets:

Hierarchical/Unified Action Policy Heads: Most systems employ a softmax head over action types (GUI primitives and APIs/tools), with per-action argument heads—regression for coordinates, classifiers for element IDs, and text generation for API strings (Mu et al., 6 Nov 2025, Lai et al., 19 Aug 2025).
Loss Functions: Training combines cross-entropy (action type and argument), status prediction (CONTINUE/FINISH), and, if RL is used, step-level or trajectory-level rewards with normalization and, in some cases, auxiliary rewards for tool invocation (Yang et al., 20 Oct 2025).
Representation Learning for Hybrid Actions: HyAR introduces an embedding table for discrete actions (GUI events), along with a conditional VAE for continuous parameters (API args), enabling DRL over hybrid spaces with minor architectural adaptation (Li et al., 2021).
Observational Inputs: Inputs to the policy network generally include screenshots, accessibility data/parsed control trees, prior action/observation tuples, and tool schema lists.
Efficiency Enhancements: Speculative plan batching (Zhang et al., 20 Apr 2025), shortcut mining (Zhao et al., 8 Sep 2025), and reward shaping for tool use (Yang et al., 20 Oct 2025) are employed for improved efficiency and balanced exploration/exploitation of both action types.

This dual-head or hybrid-policy modeling has been empirically shown to scale to branching factors of ~800 (GUI*targets + API*args) per step, requiring careful sampling and curriculum design.

7. Limitations, Open Problems, and Benchmarks

Despite the robust empirical gains, several limitations and challenges persist:

Incomplete API Coverage: Many GUI tasks still lack robust programmatic APIs; brittle GUI actions remain unavoidable (Yang et al., 20 Oct 2025).
Ambiguity and Context Drift: Hybrid agents can struggle with ambiguous spec, drifting GUI layouts, or outdated control maps, especially for vision-based GUI grounding (Zhang et al., 20 Apr 2025).
Tool Kriging and Expansion: Automatic extraction of APIs/tools from documentation or codebases, plus dynamic mining of new macro-actions, remains an open research front (Yang et al., 20 Oct 2025, Zhao et al., 8 Sep 2025).
Benchmarks and Evaluation: Standardized testbeds such as GUI-360, OSWorld, MAS-Bench, and MCPWorld explicitly support hybrid modalities and provide metrics for both per-action and end-to-end agent assessment (Mu et al., 6 Nov 2025, Zhao et al., 8 Sep 2025, Yan et al., 9 Jun 2025).
Sample Efficiency and RL Training: As seen in UltraCUA and ComputerRL, RL over hybrid spaces remains challenging, with sample efficiency, entropy collapse, and reward signal design demanding further work (Yang et al., 20 Oct 2025, Lai et al., 19 Aug 2025).
Security and Safety: Programmatic actions (especially code) may have safety implications; isolation and replay are common safeguards (Song et al., 5 Aug 2025).

Current benchmarks consistently demonstrate that providing both GUI and API primitives yields measurable improvements in SR and execution efficiency—affirming the hybrid paradigm as the prevailing foundation for general computer-use agents.