GUI-G²: Gaussian Reward Modeling for GUI Grounding
- GUI-G² is a spatial reasoning paradigm that maps natural language instructions to precise GUI coordinates through continuous Gaussian reward modeling.
- It integrates Gaussian point and coverage rewards with adaptive variance, reducing prediction error significantly compared to binary reward models.
- Empirical results demonstrate that GUI-G² achieves state-of-the-art accuracy on diverse GUI benchmarks, vastly improving reinforcement learning convergence.
Graphical User Interface Gaussian Grounding (GUI-G) is a spatial reasoning and reward modeling paradigm for GUI grounding tasks, wherein natural language instructions are mapped to precise locations on a user interface for autonomous interaction. Unlike previous reinforcement learning (RL) approaches that rely on sparse, binary rewards, GUI-G introduces continuous, geometry-aware Gaussian reward models better aligned with the spatial characteristics of human-computer interaction. The paradigm has driven state-of-the-art advancements in vision-language agents, facilitating robust, generalizable grounding across diverse interface types and resolutions (Tang et al., 21 Jul 2025, Zhao et al., 6 Feb 2026).
1. Motivation and Conceptual Advances
Traditional RL-based GUI grounders assign binary rewards based on whether a predicted point lies inside or outside a target element’s bounding box. This approach produces sparse signals that lack spatial sensitivity and fail to provide informative gradients for near-miss errors. GUI element interaction is fundamentally planar, not pointwise; human click data reveals a continuous, approximately Gaussian distribution centered on element centroids. GUI-G operationalizes this by modeling target regions as two-dimensional Gaussian fields, yielding exponentially decaying rewards proportional to the spatial proximity of predicted actions to the ground-truth location. This produces dense gradients across the interface, accelerates RL convergence, and substantially improves alignment with human spatial tolerance (Tang et al., 21 Jul 2025).
2. Core Reward Formulation
GUI-G decomposes the reward into two synergistic measures:
- Gaussian Point Reward: Let denote the bounding box center, with element-parameterized covariance
For a predicted center ,
This yields maximal reward at perfect alignment, decaying isotropically with distance from the target.
- Gaussian Coverage Reward: To account for both centrality and coverage, the Bhattacharyya coefficient is used to measure the overlap of predicted and ground-truth Gaussians:
with as the mean covariance.
- Adaptive Variance Mechanism: To maintain scale-invariant spatial tolerance, variances are set adaptively:
where is the box and empirically.
- Composite Reward:
with for balanced weighting.
Ablations confirm these components jointly produce the most stable and accurate models in GUI grounding benchmarks (Tang et al., 21 Jul 2025).
3. Model Architectures and Learning Pipeline
The GUI-G framework has been instantiated in multiple high-performance models, prominently including POINTS-GUI-G-8B (Zhao et al., 6 Feb 2026):
- Vision Encoder: Qwen2-VL-NaViT backbone with 32 transformer layers and patch size .
- Projector / Cross-Modal Adapter: Linear mapping and cross-attention layers synchronize vision and language modalities.
- LLM: Qwen3-8B (32 layers, hidden size 2048), autoregressively generating coordinate strings.
- Tokenization and Decoding: Both (x, y) points and (x₀, y₀, x₁, y₁) boxes are supported, as normalized JSON tuples.
The input (image, instruction) pair is tokenized, encoded jointly via cross-modal transformers, and decoded via autoregressive coordinate prediction.
4. Data Engineering and Task Curation
GUI grounding datasets are characterized by heterogeneous annotation formats, spatial scales, and annotation noise. The POINTS-GUI-G pipeline (Zhao et al., 6 Feb 2026) standardizes these through:
- Format Unification: All tasks are recast as center-point or bounding-box localization, with coordinates normalized to .
- Noise Reduction: OmniParser-v2 extracts candidate elements; coverage filtering removes imprecise samples.
- Complexity Enhancement: Layout entropy measures (combining 1D projection and 2D grid entropies) partition data into Easy/Medium/Hard tiers and are augmented synthetically (e.g., HTML renderings, overlaying windows) to increase representational diversity.
This disciplined data engineering—particularly entropy-based complexity stratification and curated augmentation—substantially boosts model robustness, accounting for over 10 points gain in accuracy (Zhao et al., 6 Feb 2026).
5. Reinforcement Learning Formulation
GUI-G employs Group Relative Policy Optimization (GRPO), a PPO variant, to maximize the composite Gaussian reward.
- Multi-Rollout Normalized Advantage: For each instruction, rollouts are sampled; the advantage is batch-normalized:
- RL Objective:
with typical hyperparameters: learning rate , batch size 8, KL penalty 0.04.
When combined with rich, dense Gaussian rewards, this setup yields smoother, monotonic convergence, in contrast to the oscillatory and brittle behavior observed under binary rewards. Convergence analysis demonstrates that dense gradients reduce the average error in center prediction from ~290 px to ~150 px during early training, enhancing both sample and compute efficiency (Tang et al., 21 Jul 2025).
6. Empirical Results and Comparative Evaluation
GUI-G models establish new benchmarks for GUI grounding accuracy, surpassing prior state-of-the-art including UI-TARS-72B, despite orders-of-magnitude parameter disparity. Table 1 summarizes representative results on three benchmarks (Tang et al., 21 Jul 2025):
| Model | ScreenSpot | ScreenSpot-v2 | ScreenSpot-Pro |
|---|---|---|---|
| UI-TARS-72B | 88.4% | 90.3% | 38.1% |
| GUI-G-7B | 92.0% | 93.3% | 47.5% |
| Absolute gain | +3.6% | +3.0% | +9.4% |
| Relative gain | +4.1% | +3.3% | +24.7% |
POINTS-GUI-G-8B further achieves SOTA or near-SOTA on ScreenSpot-v2 (95.7%), OSWorld-G (66.0%), and UI-Vision (49.9%). Ablation studies attribute incremental gains as follows: +18.7 from data engineering, +6.3 from vision encoder fine-tuning, +4.5 from resolution consistency, and +3.9 from RL optimization (Zhao et al., 6 Feb 2026).
7. Robustness, Limitations, and Future Directions
The continuous Gaussian reward formulation endows GUI-G models with notable robustness to:
- Unseen Layouts: Continuous spatial uncertainty modeling enhances adaptation to novel GUI structures and icon arrangements.
- Element Scale Variation: Adaptive variance maintains reward informativeness across both tiny mobile icons and large desktop panels.
- Dense and Occluded Scenarios: Gaussian coverage rewards guide clicks toward ambiguous targets even under occlusion or high density.
Remaining limitations include semantic errors in icon recognition, which are not fully addressed by improved spatial reward shaping. Proposed extensions involve integrating temporal context for multi-step GUI sequences, joint grounding with planning for downstream action policies, leveraging auxiliary modality signals (e.g., DOM/AXTree), and adapting the core framework to 3D or VR interfaces (Zhao et al., 6 Feb 2026).
References
- GUI-G: Gaussian Reward Modeling for GUI Grounding, Fei Tang et al. (Tang et al., 21 Jul 2025)
- POINTS-GUI-G: GUI-Grounding Journey (Zhao et al., 6 Feb 2026)