Papers
Topics
Authors
Recent
2000 character limit reached

Interoperable GUIs: Frameworks & Benchmarks

Updated 7 February 2026
  • Interoperable GUIs are systems that enable seamless cross-application control by decoupling semantic intent from platform-specific mechanisms through formal models and declarative primitives.
  • They leverage modular designs and standardized schemas to integrate diverse applications, allowing for automation and transferability across different platforms and versions.
  • Benchmarks like TransBench quantify key metrics such as grounding accuracy and mean point-to-center distance, ensuring reliable performance in agent-driven GUI interactions.

Interoperable graphical user interfaces (GUIs) are systems, abstractions, or toolkits designed to enable seamless operation, control, or automation of multiple existing GUI-based applications, devices, or platforms by humans and/or agents, without requiring application-specific modifications or bindings. They address the fundamental challenge that mainstream GUIs were predominantly engineered for direct human usage, rather than for programmatic, multi-platform, or cross-modal control, thereby impeding integration, automation, and transferability across the software, hardware, and agentic ecosystem.

1. Foundational Abstractions and Formal Models

Modern interoperability in GUIs is underpinned by formal models that abstract away platform-specific interaction details and emphasize semantic intent. The Goal-Oriented Interface (GOI) framework models a GUI application as a state machine G=(S,C,δ)\mathcal{G} = (S, C, \delta), where SS is the set of UI states (such as accessibility trees), CC enumerates all controls (buttons, menus, fields), and δ\delta is the transition map under atomic actions (click, keypress, drag) (Wang et al., 6 Oct 2025). GOI exposes three cross-applicable declarative primitives:

  • Access(c)(c): effecting visibility and activation of a designated control cc (unique identifier in CC) via a deterministic navigation and click sequence.
  • State(c,starget)(c, s_{\text{target}}): coercing control cc into a prescribed end-state stargets_\text{target} through application of platform-native control patterns indicated by UI metadata.
  • Observation(c)(c): extracting and returning structured data from control cc (e.g., text value, selections, or expanded content).

AutoGLM similarly adopts an intermediate interface: each GUI screen is linearized into a list of elements ei=(idi,texti,bboxi,attrsi)e_i = (\texttt{id}_i, \texttt{text}_i, \texttt{bbox}_i, \texttt{attrs}_i), decoupling the agent's planning from low-level execution (Liu et al., 2024). GUI agents encode user intent as natural language goals, which are then mapped to high-level, abstract action/element references understood consistently across heterogeneous GUI backends.

These abstractions allow for the definition of interoperability as transferability—the ability for agents or toolkits to preserve intended behavior across versions, platforms, or applications, as formalized and quantified in the TransBench benchmark (Lu et al., 23 May 2025).

2. Architectures and Mechanism-Policy Separation

A primary engineering tenet is the strict separation of policy (semantic reasoning and planning) from mechanism (concrete application navigation and interaction):

  • The policy layer—typically embodied by a LLM or multimodal model—operates exclusively on semantic constructs: control/object identifiers, desired states, and user intent. It computes abstract plans or sequences of high-level, declarative commands (e.g., {Access(ci),State(cj,sj),Observation(ck)}\{\texttt{Access}(c_i), \texttt{State}(c_j, s_j), \texttt{Observation}(c_k)\} in GOI).
  • The mechanism layer is responsible for mapping abstract commands onto deterministic, platform-specific UI actions, using, for example, in-memory navigation forests built from offline GUI ripping (GOI), platform accessibility APIs, or device-specific drivers (WebDriver, Android AccessibilityService).

AutoGLM rigidly enforces this split through its intermediate interface, which defines a serialization protocol for actions and elements that is agnostic to both the agent policy and the underlying GUI substrate (Liu et al., 2024).

Mechanism Summaries

Framework Policy Layer Mechanism Layer
GOI LLM plans over control IDs/states Forest-based resolver, UIA executor
AutoGLM LLM/MM planner outputs abstract Platform adapter with grounder module
commands over screen elements (matches references to pixel/DOM locations)

This architecture both abstracts and insulates high-level reasoning from per-platform GUI idiosyncrasies, increasing resilience to layout or control changes.

3. Practical Toolkits and System Implementations

Interoperable GUI systems manifest as agentic platforms, extensible front-ends, web-based UIs, and code-GUI fusion environments.

  • GOI: Enables high-level, declarative LLM interaction with Windows applications via UI Automation (UIA), with the framework showing a +67% increase in task success rate and –43.5% reduction in interaction steps compared to prior baselines (Wang et al., 6 Oct 2025).
  • AutoGLM: Deploys on both browsers and Android with unmodified policy weights owing to its intermediate schema; demonstrates 55.2–96.2% end-to-end task completion rates on demanding multi-app web/mobile benchmarks (Liu et al., 2024).
  • ROS-based interface: A modular, web-based GUI leveraging ROSBridge and roslibjs, whose plugin architecture allows any ROS topic/service/action to be exposed with declarative manifests; new widgets or system features are added absent code changes in the core (Fresnillo et al., 2024).
  • GuiLiner: Java/Swing launcher driven by XML descriptors, converting CLI scientific tools into cross-platform GUIs with zero per-tool code (0806.0314).
  • EasyInterface: Web- and Eclipse-based IDEs, powered by a compact XML-based server configuration and declarative output language, automatically accommodate arbitrary CLI analysis or simulation tools in a shared GUI ecosystem (Doménech et al., 2017).
  • mage: Jupyter API supporting bidirectional synchronization of GUI widget states and code cells, enabling reproducible, interchangeable workflow between interactive graphical and script-based tool usage (Kery et al., 2020).

These systems fundamentally rely on (a) declarative specifications, (b) plugin manifests or schemas, and (c) standardized communication protocols (JSON, XML, WebSocket, etc.) to mediate interoperability.

4. Transferability: Metrics, Benchmarks, and Cross-Platform Mapping

TransBench introduces formal metrics for GUI grounding accuracy and transfer—the probability that an agent's predicted action is correct under evolving GUIs (Lu et al., 23 May 2025). It distinguishes three transfer dimensions:

  • Cross-version: Robustness to UI changes between application versions (new layouts, relocated controls).
  • Cross-platform: Generalization across OS boundaries, e.g., Android, iOS, Web, each with distinct widget/toolkit sets.
  • Cross-application: Reuse of learned groundings and intent-action mappings across functionally overlapping or divergent apps.

Two principal metrics are used:

  1. Grounding accuracy (Accground\mathrm{Acc}_{\mathrm{ground}}): Fraction of agent-predicted click coordinates that fall inside the correct GUI control bounding box.
  2. Mean point-to-center distance (DD): Euclidean distance from predicted coordinate to the canonical center of the target box, normalized to screen dimensions.

MLLM architectures comprising vision encoders, language encoders, and fusion/grounding modules have achieved up to 89.6% accuracy (Qwen2.5VL), with fine-tuning yielding +7–10% transfer gains across unseen versions or platforms. Hierarchical GUI element representations and metadata-driven pipelines are critical: visually, positionally, and semantically encoded elements allow models to transfer intent-action mappings despite nontrivial UI refactorings (Lu et al., 23 May 2025).

Empirical approaches to platform-scale interoperability include:

  • Automated Phone-to-TV GUIs: Systematic extraction of phone app UI structure (atomic to row/multi-row grouping), assignment to 12 canonical group templates, and mapping to one of 9 TV-side layout archetypes via a rule table with demonstrated >90% mapping accuracy. Final TV UIs are synthesized with a DSL plus constraint solving for card, channel, and grid layouts (Hu et al., 2023).

5. Modular Design, Extensibility, and Plugin Ecosystems

Interoperable GUI systems consistently employ modular and declarative architectures to facilitate extensibility:

  • Plugin Registries: Formal schemas (often JSON or XML) describe each feature/module, the GUI components to instantiate, the platform resources to bind (e.g., ROS topics/services/actions), and access policies (Fresnillo et al., 2024). This enables seamless incorporation of new robotic tools, visualizations, or controls at runtime without recompilation.
  • Declarative Option Models: Tools such as GuiLiner and EasyInterface externalize all parameter/flag metadata, GUI widget choices, and documentation, reducing coupling and enabling core front-end logic to remain agnostic to underlying tools (0806.0314, Doménech et al., 2017).
  • Shared Protocols and Output Languages: Web IDEs and plug-ins unify diverse tool backends via HTTP/JSON or XML output-languages, with generic front-end interpreters for dialogs, popups, markers, and interactive highlights.

The principal design philosophy is to encapsulate all tool-specific or domain-specific knowledge in formal descriptors that can be composed, remixed, or updated independently from core GUI logic. This modularity is crucial for scaling GUI interoperability across diverse and evolving system landscapes.

6. Limitations, Challenges, and Future Directions

Despite advances, interoperable GUIs exhibit intrinsic limitations:

  • Dependence on accessibility frameworks: Many abstraction layers (e.g., GOI, AutoGLM) require UIA (Windows), NSAccessibility (macOS), AT-SPI (Linux), or equivalents; legacy/custom-rendered applications and dynamic web content may resist parsing or expose incomplete trees (Wang et al., 6 Oct 2025).
  • Offline modeling and context cost: Building precise GUI navigation forests or agent prompts may require hours of automated plus manual work per app version and substantial context length in LLM prompts (Wang et al., 6 Oct 2025).
  • Dynamic or highly interactive UIs: Real-time video controls, ultra-fine-grained drawing, and certain classes of games remain beyond the abstraction boundaries of most frameworks.
  • Cross-device and input modality mismatches: TV GUIs require not just layout changes but adaptation to DPAD navigation and screen aspect ratio; systems such as (Hu et al., 2023) currently only partially address this.
  • Code/GUI duality friction: In code/GUI-synchronized environments (e.g., mage), generated code quality, state drift, and multi-tool coordination pose open engineering and usability challenges (Kery et al., 2020).

Noted extensions include development of unified cross-OS adapters, dynamic topology discovery with LLMs, self-healing identifier retraining, web/DOM/desktop hybrid declarative primitives, and larger paired cross-device GUI datasets for deep learning-based conversion (Wang et al., 6 Oct 2025, Hu et al., 2023).

7. Synthesis and Best-Practice Recommendations

The contemporary paradigm for interoperable GUIs emphasizes policy–mechanism separation, declarative primitive interfaces, modular plugin architectures, and metadata-driven configuration. Design patterns that enhance interoperability include:

  • Maintaining app/version/platform metadata for robust fine-tuning (Lu et al., 23 May 2025);
  • Employing hierarchical, visually and semantically annotated GUI element representations;
  • Using balanced multi-platform training datasets when ML grounding is required;
  • Favoring declarative, externalized option schemas for new integrations;
  • Leveraging constraint-based or DSL-based code generation for cross-device adaptation (Hu et al., 2023).

Through these techniques, interoperable GUI systems achieve resilient automation, agent compatibility, and cross-device usability in increasingly heterogeneous digital environments. Continued progress in large foundation model planning, platform adapters, declarative primitives, and transfer benchmarks is likely to further close the gap towards universal GUI interoperability.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interoperable Graphical User Interfaces (GUIs).