AUI-Gym: Agent-native GUI Evaluation Benchmark
- AUI-Gym Benchmark is a rigorous framework for evaluating agent-native GUI design through automated task synthesis, programmatic verification, and iterative coder-CUA collaboration.
- It systematically tests 1,560 scenario-driven challenges across six domains, ensuring comprehensive functional assessment and multi-step navigation success.
- Its deployment leverages quantitative metrics like Function Completeness and CUA Success Rate to drive iterative improvements in automated GUI generation.
AUI-Gym Benchmark is a rigorous, automated suite for evaluating the design and usability of agent-native graphical user interfaces (GUIs). Distinguished from traditional benchmarks that focus on human-centered usability, AUI-Gym targets the capacity of programming agents—specifically Coder LLMs and Computer-Use Agents (CUA)—to collaboratively generate, judge, and refine GUIs for functional task execution. The benchmark covers a breadth of everyday and high-complexity applications across multiple domains, introduces programmatic verification for task solvability, and employs quantitative metrics that directly assess agent performance in multi-step GUI navigation and completion (Lin et al., 19 Nov 2025).
1. Benchmark Structure and Task Composition
AUI-Gym comprises 52 distinct applications partitioned into six domains: App (21%), Landing (19%), Game (17%), Interactive (17%), Tool (13%), and Utility (12%). Examples include "Healthy Meal Tracker" (App), "Nonprofit Impact Report" (Landing), "Typing Rain" (Game), and "Pomodoro Timer" (Utility). Each application is associated with 30 uniquely-crafted tasks, yielding a total of 1,560 scenario-driven challenges.
Task synthesis follows a standardized pipeline:
- Automatic task proposal: For each app, GPT-5 generates 30 realistic, multi-step tasks with specified expected outcomes, scenario categorization (core_function/user_workflow/edge_case), and workflow grounding.
- Human expert curation: Tasks are vetted to remove trivial, ambiguous, or irrelevant cases, ensuring comprehensive coverage of application capabilities and complexity strata.
- Terminal state specification: Each task is paired with a precise success condition, facilitating objective downstream verification and scoring.
This large-scale, high-fidelity task corpus sets a standard for reproducible, application-oriented GUI benchmarking (Lin et al., 19 Nov 2025).
2. Task Verifier and Success Invariants
Central to AUI-Gym is the Task Verifier, an architecture leveraging GPT-5 for dynamic HTML parsing and programmatic assessment of task feasibility. At evaluation, the Verifier analyzes a given application's GUI and a requested task. If the HTML structure and available DOM elements permit the task, the Verifier synthesizes a JavaScript checker function—an executable code fragment that determines task completion directly in the browser context.
Verifier operation can be summarized as:
- Parse DOM from GUI HTML.
- Reason over the presence and attributes of required elements.
- If all necessary interaction paths exist, emit a specialized JavaScript checker for task invariants.
- If not, reject the task as infeasible for the current UI.
For example, for the task “Create a habit named ‘Meditate 5 min,’ then view today’s chart,” the Verifier might output:
1 2 3 4 |
function checker() { return document.querySelectorAll('#gridContainer .item') .some(el => el.textContent.includes('Meditate 5 min')); } |
3. Coder–CUA Collaboration and Feedback Loop
AUI-Gym formalizes iterative UI design as a Markov Design Process. The system state encodes the current HTML/CSS/JS environment. Coder actions represent code revisions (HTML/CSS/JS patches) proposed by a coding-oriented LLM policy . Environment transitions apply these revisions, resulting in a new state .
Evaluation and feedback are provided by the CUA judge via two mechanisms:
- Task Solvability: For each task , the Verifier checks if a checker exists (). Unsupported tasks are aggregated, and the Coder is prompted with a natural language summary of required missing features for environment revision.
- Navigation Success: For tasks deemed solvable, the CUA executes stepwise navigation, recording trajectories, identifying failure states, and capturing action sequences. Failures, screenshots, and action traces are collated and distilled for actionable feedback.
Coder aims to maximize expected cumulative reward , where is determined by CUA feedback. Function completeness and navigation success metrics directly inform the next iteration of UI revision, ensuring the loop is both automated and data-driven (Lin et al., 19 Nov 2025).
4. CUA Dashboard: Visual Feedback Compression
The CUA Dashboard serves as a compact, interpretable summary of the agent’s navigation in failed or successful task attempts. The compression process:
- Identifies regions of interest by extracting bounding boxes of all GUI elements interacted with (clicked, typed, focused).
- Crops these element regions from successive screenshot frames.
- Tiles the cropped regions onto a fixed canvas, ordered temporally, with later steps assigned more area to emphasize recency and criticality.
The result is an average visual token reduction of approximately 76.2%, efficiently highlighting only those UI regions relevant to agent failures or user journeys. This compressed single-image summary assists developers (human or model-based Coder agents) in rapid diagnosis and iterative redesign (Lin et al., 19 Nov 2025).
5. Evaluation Protocols and Empirical Outcomes
AUI-Gym enforces robust, automatic measurement via two principal quantitative metrics:
- Function Completeness (FC): Fraction of total tasks with a successfully generated Verifier checker
- CUA Success Rate (SR): Fraction of all tasks for which the CUA executes the checker and passes
Empirical results for GPT-5 and variants are summarized below:
| Domain | FC @ Baseline | FC @ Integrated | SR @ Baseline | SR @ Integrated |
|---|---|---|---|---|
| Overall | 67.9% | 81.5% | 24.5% | 26.0% |
Integrated task and navigation revision increase FC substantially (+13.6 pp), with smaller but meaningful boosts to SR (+1.5 pp). Weaker coding models (Qwen3, GPT-4o) benefit even more in SR (up to +11.7 pp). Ablation studies show that solvability feedback is the primary driver for FC improvement, whereas navigation guidance and the CUA Dashboard are critical for SR gains (Lin et al., 19 Nov 2025).
6. Implementation, Usage, and Open Resources
AUI-Gym is released under open code and data licensing, with reproducibility facilitated via explicit installation and execution steps:
1 2 3 4 |
git clone https://github.com/showlab/AUI.git cd AUI pip install -r requirements.txt npx playwright install |
- Synthesizing tasks and verifiers:
python scripts/propose_tasks.py --model gpt5 - UI generation:
python scripts/generate_ui.py --coder gpt5 --queries queries.json - CUA and verifier testing:
python scripts/evaluate.py --cua uitars --tasks tasks.json --output results/ - Full revision loop:
python scripts/loop.py --coder gpt5 --cua uitars --rounds 3
Directory structure includes user query templates, task specs with checkers, verifier scripts, coder wrappers for multiple models, CUA policy definitions, and dashboard visualization utilities. A public demo is available at https://huggingface.co/spaces/showlab/AUI (Lin et al., 19 Nov 2025).
7. Significance and Implications
By combining large-scale, language-model-driven task synthesis with programmatic verification and iterative Coder–CUA feedback loops, AUI-Gym establishes a new standard for benchmarking the autonomous design and usability of GUIs under agent-driven rather than human-usability-centric requirements. This approach enables scalable, fully automatic evaluation and optimization of user interfaces so that computational agents can shift from passive tool-users to active participants—capable of diagnosing failures, revising workflows, and directly influencing interface evolution. A plausible implication is that future digital environments may be increasingly tailored for optimal agent navigation and task reliability rather than exclusively for human usability (Lin et al., 19 Nov 2025).