GUI-G²: Gaussian Reward Modeling for GUI Grounding

Updated 12 March 2026

GUI-G² is a spatial reasoning paradigm that maps natural language instructions to precise GUI coordinates through continuous Gaussian reward modeling.
It integrates Gaussian point and coverage rewards with adaptive variance, reducing prediction error significantly compared to binary reward models.
Empirical results demonstrate that GUI-G² achieves state-of-the-art accuracy on diverse GUI benchmarks, vastly improving reinforcement learning convergence.

Graphical User Interface Gaussian Grounding (GUI-G $^2$ ) is a spatial reasoning and reward modeling paradigm for GUI grounding tasks, wherein natural language instructions are mapped to precise locations on a user interface for autonomous interaction. Unlike previous reinforcement learning (RL) approaches that rely on sparse, binary rewards, GUI-G $^2$ introduces continuous, geometry-aware Gaussian reward models better aligned with the spatial characteristics of human-computer interaction. The paradigm has driven state-of-the-art advancements in vision-language agents, facilitating robust, generalizable grounding across diverse interface types and resolutions (Tang et al., 21 Jul 2025, Zhao et al., 6 Feb 2026).

1. Motivation and Conceptual Advances

Traditional RL-based GUI grounders assign binary rewards based on whether a predicted point lies inside or outside a target element’s bounding box. This approach produces sparse signals that lack spatial sensitivity and fail to provide informative gradients for near-miss errors. GUI element interaction is fundamentally planar, not pointwise; human click data reveals a continuous, approximately Gaussian distribution centered on element centroids. GUI-G $^2$ operationalizes this by modeling target regions as two-dimensional Gaussian fields, yielding exponentially decaying rewards proportional to the spatial proximity of predicted actions to the ground-truth location. This produces dense gradients across the interface, accelerates RL convergence, and substantially improves alignment with human spatial tolerance (Tang et al., 21 Jul 2025).

2. Core Reward Formulation

GUI-G $^2$ decomposes the reward into two synergistic measures:

Gaussian Point Reward: Let $\boldsymbol{\mu}_{gt} = (c_x^{gt},\,c_y^{gt})$ denote the bounding box center, with element-parameterized covariance

$\boldsymbol{\Sigma}_{gt} = \begin{pmatrix} \sigma_x^2 & 0 \ 0 & \sigma_y^2 \end{pmatrix}.$

For a predicted center $\boldsymbol{\mu}_p = (c_x^p, c_y^p)$ ,

$R_{point} = \exp\left(-{\textstyle\frac{1}{2}} \left[ \frac{(c_x^p-c_x^{gt})^2}{\sigma_x^2} + \frac{(c_y^p-c_y^{gt})^2}{\sigma_y^2} \right]\right).$

This yields maximal reward at perfect alignment, decaying isotropically with distance from the target.

Gaussian Coverage Reward: To account for both centrality and coverage, the Bhattacharyya coefficient is used to measure the overlap of predicted and ground-truth Gaussians:

$R_{coverage} = \exp\left( -\frac{1}{8} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_{gt})^\top \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu}_p - \boldsymbol{\mu}_{gt}) - \frac{1}{2} \ln \frac{ \det\boldsymbol{\Sigma} }{ \sqrt{ \det{\boldsymbol{\Sigma}_p} \det{ \boldsymbol{\Sigma}_{gt} } } } \right),$

with $\boldsymbol{\Sigma}$ as the mean covariance.

Adaptive Variance Mechanism: To maintain scale-invariant spatial tolerance, variances are set adaptively:

$\sigma_x = \alpha (x_2 - x_1)\,,\qquad \sigma_y = \alpha (y_2 - y_1),$

where $[x_1, y_1, x_2, y_2]$ is the box and $\alpha=0.5$ empirically.

Composite Reward:

$R_{\mathrm{GUI}-G^2} = \nu R_{point} + \gamma R_{coverage},$

with $\nu = \gamma = 1.0$ for balanced weighting.

Ablations confirm these components jointly produce the most stable and accurate models in GUI grounding benchmarks (Tang et al., 21 Jul 2025).

3. Model Architectures and Learning Pipeline

The GUI-G $^2$ framework has been instantiated in multiple high-performance models, prominently including POINTS-GUI-G-8B (Zhao et al., 6 Feb 2026):

Vision Encoder: Qwen2-VL-NaViT backbone with 32 transformer layers and patch size $14\times14$ .
Projector / Cross-Modal Adapter: Linear mapping and cross-attention layers synchronize vision and language modalities.
LLM: Qwen3-8B (32 layers, hidden size 2048), autoregressively generating coordinate strings.
Tokenization and Decoding: Both (x, y) points and (x₀, y₀, x₁, y₁) boxes are supported, as normalized JSON tuples.

The input (image, instruction) pair is tokenized, encoded jointly via cross-modal transformers, and decoded via autoregressive coordinate prediction.

4. Data Engineering and Task Curation

GUI grounding datasets are characterized by heterogeneous annotation formats, spatial scales, and annotation noise. The POINTS-GUI-G pipeline (Zhao et al., 6 Feb 2026) standardizes these through:

Format Unification: All tasks are recast as center-point or bounding-box localization, with coordinates normalized to $[0,1]$ .
Noise Reduction: OmniParser-v2 extracts candidate elements; coverage filtering removes imprecise samples.
Complexity Enhancement: Layout entropy measures (combining 1D projection and 2D grid entropies) partition data into Easy/Medium/Hard tiers and are augmented synthetically (e.g., HTML renderings, overlaying windows) to increase representational diversity.

This disciplined data engineering—particularly entropy-based complexity stratification and curated augmentation—substantially boosts model robustness, accounting for over 10 points gain in accuracy (Zhao et al., 6 Feb 2026).

5. Reinforcement Learning Formulation

GUI-G $^2$ employs Group Relative Policy Optimization (GRPO), a PPO variant, to maximize the composite Gaussian reward.

Multi-Rollout Normalized Advantage: For each instruction, $N$ rollouts are sampled; the advantage is batch-normalized:

$A_i = \frac{R_{total}(\tau_i) - \mathrm{mean}_j\,R_{total}(\tau_j)}{\mathrm{std}_j\,R_{total}(\tau_j)}.$

RL Objective:

$\mathcal{J} = \mathbb{E} \left[ \sum_{t} \min( r_t A_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon)A_t ) \right] - \beta\,D_{\mathrm{KL}}\left[\pi_\theta \| \pi_{\mathrm{ref}} \right],$

with typical hyperparameters: learning rate $1\times10^{-6}$ , batch size 8, KL penalty 0.04.

When combined with rich, dense Gaussian rewards, this setup yields smoother, monotonic convergence, in contrast to the oscillatory and brittle behavior observed under binary rewards. Convergence analysis demonstrates that dense gradients reduce the average error in center prediction from ~290 px to ~150 px during early training, enhancing both sample and compute efficiency (Tang et al., 21 Jul 2025).

6. Empirical Results and Comparative Evaluation

GUI-G $^2$ models establish new benchmarks for GUI grounding accuracy, surpassing prior state-of-the-art including UI-TARS-72B, despite orders-of-magnitude parameter disparity. Table 1 summarizes representative results on three benchmarks (Tang et al., 21 Jul 2025):

Model	ScreenSpot	ScreenSpot-v2	ScreenSpot-Pro
UI-TARS-72B	88.4%	90.3%	38.1%
GUI-G $^2$ -7B	92.0%	93.3%	47.5%
Absolute gain	+3.6%	+3.0%	+9.4%
Relative gain	+4.1%	+3.3%	+24.7%

POINTS-GUI-G-8B further achieves SOTA or near-SOTA on ScreenSpot-v2 (95.7%), OSWorld-G (66.0%), and UI-Vision (49.9%). Ablation studies attribute incremental gains as follows: +18.7 from data engineering, +6.3 from vision encoder fine-tuning, +4.5 from resolution consistency, and +3.9 from RL optimization (Zhao et al., 6 Feb 2026).

7. Robustness, Limitations, and Future Directions

The continuous Gaussian reward formulation endows GUI-G $^2$ models with notable robustness to:

Unseen Layouts: Continuous spatial uncertainty modeling enhances adaptation to novel GUI structures and icon arrangements.
Element Scale Variation: Adaptive variance maintains reward informativeness across both tiny mobile icons and large desktop panels.
Dense and Occluded Scenarios: Gaussian coverage rewards guide clicks toward ambiguous targets even under occlusion or high density.

Remaining limitations include semantic errors in icon recognition, which are not fully addressed by improved spatial reward shaping. Proposed extensions involve integrating temporal context for multi-step GUI sequences, joint grounding with planning for downstream action policies, leveraging auxiliary modality signals (e.g., DOM/AXTree), and adapting the core framework to 3D or VR interfaces (Zhao et al., 6 Feb 2026).

References

GUI-G $^2$ : Gaussian Reward Modeling for GUI Grounding, Fei Tang et al. (Tang et al., 21 Jul 2025)
POINTS-GUI-G: GUI-Grounding Journey (Zhao et al., 6 Feb 2026)

Markdown Report Issue Upgrade to Chat

References (2)

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding (2025)

POINTS-GUI-G: GUI-Grounding Journey (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GUI-G$^2$.