UI-Ins-7B: 7B GUI Grounding Model
- UI-Ins-7B is a 7-billion-parameter GUI grounding model that dynamically interprets natural-language instructions using multi-perspective reasoning to enhance accuracy.
- It integrates supervised fine-tuning and reinforcement learning to refine instruction synthesis and overcome challenges like ambiguity and mismatch in GUI datasets.
- Empirical evaluations on multiple benchmarks demonstrate significant improvements in grounding performance, especially for ambiguous and implicit instructions.
UI-Ins-7B is a 7-billion-parameter GUI grounding model developed to advance the accuracy and reliability of graphical user interface (GUI) understanding, particularly in the context of mapping natural-language instructions to actionable UI elements. Designed within the “Instruction-as-Reasoning” paradigm, UI-Ins-7B approaches user instructions not as static strings, but as dynamic analytical reasoning pathways that enable multi-perspective interpretation, selective pathway composition, and improved inference-time decision making. This framework addresses previously underexplored challenges related to instruction quality, ambiguity, and expressivity in GUI grounding datasets, and represents a significant methodological shift in the training and inference of GUI agents (Chen et al., 23 Oct 2025).
1. Motivation and Empirical Diagnosis
The motivation for UI-Ins-7B originates from systematic shortcomings observed in prior GUI grounding systems and datasets. Manual inspection of 1,909 samples from benchmarks such as OS-Atlas, Widget Captioning, and AMEX revealed that 23.3% of instructions exhibited fundamental flaws, including ambiguity (one-to-many correspondence), mismatch (no referent on screen), or other inaccuracies. Poor-quality instructions not only degrade training but directly introduce noise in instruction–screenshot associations and grounding labels.
Empirical investigations further demonstrated that instruction diversity is a powerful lever for grounding performance. When ScreenSpot-Pro benchmark instructions were rewritten from four perspectives (appearance, functionality, location, intent), alternative formulations often outperformed originals, with an oracle “Combined” scenario—choosing the best perspective per sample—yielding a 76% relative improvement. This highlighted that many failures are not attributable to weak visual models, but to the suboptimality of static, single-viewpoint instructions, underscoring the need for models capable of dynamically reinterpreting or composing instructions (Chen et al., 23 Oct 2025).
2. System Architecture and Training Procedure
UI-Ins-7B implements a dual-stage training pipeline, integrating both supervised and reinforcement learning to support multi-perspective instruction reasoning and inference-time pathway selection.
Data Pipeline and Preprocessing
- Data Sources: Aggregated from OS-Atlas, Omniact, Android Control, AMEX, and AgentNet, providing coverage across Windows, macOS, Linux, and Android GUIs.
- Cleaning: OmniParser V2 is used to detect UI elements and refine ground-truth bounding boxes via IoU-based heuristic filtering, reducing post-processed instruction flaw rates to below 8%.
- Instruction Augmentation: GPT-4.1 synthesizes four distinct instruction formulations—appearance-based, function-based, spatial/location-based, and intent/goal-based—for each target element. Uniqueness verification is conducted with additional GPT-4.1 passes to ensure unambiguous referencing (Chen et al., 23 Oct 2025).
Stage 1: Supervised Fine-tuning (SFT)
SFT is used to instill multi-perspective reasoning. The model is trained to output a structured sequence consisting of:
> ...: A reasoning trace, typically a reformulated instruction from a specific perspective.- Structured grounding output: A JSON-formatted action (e.g., click) and coordinate.
Formally, for each instance , the maximization objective is: where is a reasoning text from one perspective and is the grounded coordinate.
A critical implementation method involves randomly choosing two distinct instruction perspectives per training instance, one for the input and another for the model’s reasoning target, enforcing cross-perspective translational ability (Chen et al., 23 Oct 2025).
Stage 2: Reinforcement Learning (GRPO)
Reinforcement learning via Group Relative Policy Optimization (GRPO) enables pathway selection and synthesis at inference. No explicit enumeration of perspectives occurs during RL—the model is simply prompted to “think,” and is rewarded solely according to grounding success (point-in-box). The policy improvement objective normalizes the reward advantage among multiple rollouts via Z-scoring: where is individual rollout reward and is rollout count (Chen et al., 23 Oct 2025).
No explicit inference-time search or perspective-ensemble is prescribed; selection is emergent from the RL-trained policy’s reasoning trace.
3. Multi-Perspective Reasoning: Instruction-as-Reasoning
UI-Ins-7B’s key innovation is treating instructions as dynamic reasoning pathways (Instruction-as-Reasoning), supporting the following analytic views:
- Appearance: E.g., “the red X”
- Functionality: E.g., “close the file manager”
- Location: E.g., “top-right button”
- Intent/Goal: E.g., “get rid of this screen”
The model is explicitly trained to rewrite or reinterpret instructions into the perspective best suited for a given screenshot’s cues. Following RL, compositional or synthesized pathways that mix multiple perspectives emerge, with the model often integrating appearance, function, and intent within single reasoning traces (Chen et al., 23 Oct 2025).
4. Benchmarks and Empirical Performance
UI-Ins-7B is evaluated on five GUI grounding benchmarks and one agentic benchmark:
| Benchmark | UI-Ins-7B | Notable Comparators |
|---|---|---|
| MMBench-GUI L2 | 83.1 | InfiGUI-G1-7B: 80.8 |
| UI-I2E-Bench | 81.1 | GTA1-7B: 78.2 |
| ScreenSpot-Pro | 52.2 | InfiGUI-G1-7B: 51.9 |
| ScreenSpot-V2 | 94.0 | UI-Venus-7B: 94.1 |
| Showdown | 73.1 | GTA1-32B: 71.1 |
| AndroidWorld | 74.1% | UI-TARS-2: 73.3 |
Key findings:
- Substantive improvements in implicit, advanced, or ambiguous instruction scenarios.
- On AndroidWorld, used as an action executor under a GPT-5 planner, UI-Ins-7B delivered a 24.1 point gain over a Qwen2.5-VL-7B executor, highlighting the relevance of accurate grounding for end-to-end agent performance (Chen et al., 23 Oct 2025).
- Ablations show both SFT and RL, and crucially the intermediate reasoning step, are necessary for peak performance; removing reasoning reduces UI-I2E accuracy by over 10%.
- Structured, instruction-like reasoning outperforms unconstrained free-form reasoning: RL with free-form CoT can degrade grounding (e.g., UI-TARS-1.5-7B drops from 50.1 to 46.9), whereas using Instruction-as-Reasoning improves results.
5. Emergent Reasoning Behaviors and Analysis
Post-RL, UI-Ins-7B exhibits three emergent reasoning capabilities:
- Strategic Perspective Selection: Adapts reasoning style to context.
- Compositional Integration: Concatenates or fuses multiple perspectives in one trace.
- Emergence of New Perspectives: Generates analytical frames outside the training set (e.g., group affiliation, UI element state).
A 10-way taxonomy is used post hoc for analysis: appearance, functionality, location, intent, structural relationship, state, component type, sequential position, salience, accessibility.
Empirical studies confirm reasoning-on-instruction helps only when constrained to structured, GUI-relevant forms; generic CoT is detrimental. Additionally, the diverse SFT phase serves as an exploratory warm-up, mitigating policy collapse during RL. Without the Instruction-as-Reasoning stage, standard SFT+RL collapses policy outputs in harder settings (Chen et al., 23 Oct 2025).
6. Error Analysis, Limitations, and Practical Caveats
Principal error types observed include:
- Domain-specific knowledge gaps: E.g., inability to map brand entities.
- Layout understanding limits: Semantic target identified but clickable region missed.
- Visual ambiguity/hallucination: Confusion among visually similar icons.
Other practical considerations:
- Heavy reliance on GPT-4.1 LLMs for instruction synthesis and verification.
- Absence of explicit architectural details for grounding heads or visual encoders in the provided documentation.
- No explicit inference-time multi-perspective search or ensembling; pathway selection is end-to-end and implicit.
7. Comparative Assessment and Deployment Context
UI-Ins-7B is positioned as a strong 7B-scale GUI grounding model, particularly in settings where instructions are ambiguous, implicit, or semantically indirect. For scenarios requiring compact, accurate grounding in dynamic agent systems, e.g., as the executor in a modular agentic framework, UI-Ins-7B delivers state-of-the-art accuracy among models of comparable scale, especially on implicit and advanced instructions (Chen et al., 23 Oct 2025).
Comparison to contemporary 7B models:
- UI-Ins-32B achieves the strongest overall results across all tested benchmarks.
- The path of further improvement is suggested to be along the lines of explicit UI element understanding, as articulated by “UI-in-the-Loop” frameworks (Li et al., 8 Apr 2026), or improved planning/memory mechanisms (Liu et al., 1 Oct 2025), though UI-Ins-7B already leads on standard grounding tasks.
In sum, UI-Ins-7B operationalizes a paradigm shift in GUI grounding: enabling dynamic, perspective-flexible interpretation of instructions and exploiting reasoning as a reinforcement objective, yielding robust state-of-the-art performance across a spectrum of challenging GUI environments (Chen et al., 23 Oct 2025).