Gaussian Rewards in GUI Interaction

Updated 22 January 2026

The paper introduces Gaussian rewards for GUI interaction by modeling elements as 2D Gaussian distributions to replace sparse, binary feedback.
It combines Gaussian point and coverage rewards to enhance element localization through precise spatial alignment and overlap measurement.
Adaptive variance mechanisms inspired by Fitts’ Law adjust to element dimensions, leading to improved training efficiency and generalizability.

Gaussian Rewards for GUI Interaction

Graphical User Interface (GUI) interaction agents increasingly rely on precise spatial grounding techniques to translate natural language instructions into actionable interface locations. Central to recent advances is the modeling of GUI elements as continuous spatial probability distributions using two-dimensional Gaussians. This paradigm enables the replacement of sparse, binary rewards with dense, adaptive signals that robustly guide reinforcement and attention-based learning frameworks. Such "Gaussian rewards"—inspired by the statistical patterns of human clicking behavior—shape both policy optimization and attention calibration for improved element localization, generalizability, and training efficiency across diverse interface layouts.

1. Foundations of Gaussian Reward Modeling in GUI Grounding

Traditional GUI grounding algorithms employed binary reward structures that treated element selection as hit-or-miss: successful interactions yielded positive feedback if the predicted point resided within a target bounding box, otherwise zero. This approach produces extremely sparse gradients and neglects the spatial uncertainty inherent in human interaction. Motivated by empirical findings that human pointing endpoints conform approximately to Gaussian distributions centered on target elements, Gaussian rewards encode a principled measure of spatial proximity and alignment.

A 2D Gaussian representing an element with bounding box $b = [x_1, y_1, x_2, y_2]$ is defined by its centroid $\mu = (c_x, c_y) = ((x_1 + x_2)/2, (y_1 + y_2)/2)$ and diagonal covariance matrix $\Sigma = \operatorname{diag}(\sigma_x^2, \sigma_y^2)$ , with density:

$N(x; \mu, \Sigma) = \frac{1}{2\pi\sqrt{|\Sigma|}} \exp\left(-\frac{1}{2} (x-\mu)^\top \Sigma^{-1}(x-\mu)\right)$

This formalism forms the basis for continuous, differentiable reward signals for localized GUI interaction tasks (Tang et al., 21 Jul 2025).

2. Gaussian Point and Coverage Rewards

Gaussian rewards are operationalized via two synergistic mechanisms:

Gaussian Point Reward: Measures the precision of the predicted click center $\mu_p$ relative to the ground-truth Gaussian. The reward is the value of the ground-truth density at $\mu_p$ :

$R_\text{point} = N(\mu_p; \mu_\text{gt}, \Sigma_\text{gt}) = \exp \left( -\frac{1}{2} \frac{(c_{xp} - c_{xgt})^2}{\sigma_{xgt}^2} - \frac{1}{2} \frac{(c_{yp} - c_{ygt})^2}{\sigma_{ygt}^2} \right)$

Gaussian Coverage Reward: Measures spatial alignment via the Bhattacharyya coefficient, quantifying overlap between the predicted and ground-truth Gaussian distributions:

$BC(N_p, N_\text{gt}) = \int \sqrt{N(x; \mu_p, \Sigma_p) N(x; \mu_\text{gt}, \Sigma_\text{gt})} \, dx$

This overlap admits a closed form:

$R_\text{coverage} = \exp \left[ -\frac{1}{8} (\mu_p - \mu_\text{gt})^\top \bar{\Sigma}^{-1} (\mu_p - \mu_\text{gt}) -\frac{1}{2} \ln \left( \frac{\det \bar{\Sigma}}{\sqrt{\det\Sigma_p \det\Sigma_\text{gt}}} \right) \right]$

with $\bar{\Sigma} = (\Sigma_p + \Sigma_\text{gt}) / 2$ (Tang et al., 21 Jul 2025).

These dual signals enable both fine-grained localization and robust alignment with target regions.

3. Adaptive Variance Mechanisms and Fitts’ Law Connections

Fixed variance values $\sigma$ do not account for the heterogeneity in GUI element sizes. Both GUI-G $^2$ and V2P frameworks implement adaptive variance mechanisms that tie the Gaussian spread to the element's width and height:

GUI-G $^2$ : $\sigma_x = \alpha \cdot (x_2 - x_1),\; \sigma_y = \alpha \cdot (y_2 - y_1)$ , with a default $\alpha = 0.5$ ("2 $\sigma$ principle").
V2P: $\sigma_x = w / \sigma_\text{factor},\; \sigma_y = h / \sigma_\text{factor}$ , optimized with $\sigma_\text{factor} = 1.0$ for challenging benchmarks (Chen et al., 11 Jan 2026).

This scaling mimics human motor variability modeled in Fitts’ Law, where endpoint distributions are roughly Gaussian with standard deviation proportional to target width or height, thus capturing the interaction uncertainty intrinsic to GUI actions. For larger elements, Gaussians become broader, whereas smaller targets induce sharper peaks (Chen et al., 11 Jan 2026).

4. Reinforcement and Attention Optimization Frameworks

Gaussian rewards can be integrated into reinforcement learning and attention calibration pipelines:

Policy Gradient Formulation (GUI-G $^2$ ): Rewards $R_\text{total} = \nu R_\text{point} + \gamma R_\text{coverage}$ guide actor-critic models using Group Relative Policy Optimization (GRPO)—a variant of PPO. Rewards are computed for multiple candidate responses, normalized by mean and standard deviation, and used in the clipped policy objective with additional KL regularization.
Attention Calibration (V2P): GUI interaction images are split into $M$ non-overlapping patches. 2D Gaussian mass per patch is computed via CDF integration. The model's attention vector $a$ is aligned to the normalized Gaussian prior $p$ using a KL-divergence loss. Simultaneously, a suppression attention mechanism penalizes background regions, forming a “valley” surrounding the target region and reducing attention drift.

The total V2P objective is:

$\mathcal{L} = \mathcal{L}_\text{NTP} + \lambda_1\,\mathcal{L}_\text{SupAttn} + \lambda_2\,\mathcal{L}_\text{ActionAttn}$

where $\mathcal{L}_\text{NTP}$ is the standard next-token prediction loss (Chen et al., 11 Jan 2026).

5. Empirical Evaluation and Benchmarks

Robust evaluation is performed on multiple GUI grounding datasets:

ScreenSpot and ScreenSpot-v2 (diverse device types and layouts)
ScreenSpot-Pro (high-resolution, professional interfaces)

Metrics center on the percentage of predictions whose predicted center falls within the ground-truth bounding box. Results indicate:

GUI-G $^2$ : $92.0\%$ (ScreenSpot), $93.3\%$ (ScreenSpot-v2), $47.5\%$ (ScreenSpot-Pro)
V2P-7B: $92.4\%$ (ScreenSpot-v2), $52.5\%$ (ScreenSpot-Pro)

Relative gains are substantial compared to prior methods:

GUI-G $^2$ improves $+24.7\%$ over UI-TARS-72B on ScreenSpot-Pro despite using $10\times$ fewer parameters (Tang et al., 21 Jul 2025).
V2P achieves a $5$ point accuracy gain from its Fitts-Law-guided Gaussian peak on challenging benchmarks (Chen et al., 11 Jan 2026).

6. Ablation Studies and Robustness Analysis

Comprehensive ablations isolate the contributions of Gaussian reward components and adaptive mechanisms. Key findings include:

Ablation Variant	Accuracy (%)	Interpretation
Full GUI-G $^2$ (point + coverage)	93.3	Best accuracy; both signals needed for precision & overlap
Point only	90.2	Localizes but weaker regional alignment
Coverage only	92.1	Strong overlap but less center precision
Restrict Gaussian to inside-box	88.4	Loses early-training guidance
Fixed $\sigma$	87.8	Under-constrains large elements/over-constrains small icons
2 $\sigma$ principle ( $\alpha=0.5$ )	92.9–93.3	Optimal adaptive variance
V2P w/o Gaussian peak	47.5	$-5$ pt relative loss on ScreenSpot-Pro
V2P w/o peak & suppression	44.3	$-8.2$ pt relative loss

Continuous, everywhere-applied Gaussian rewards yield smoother, monotonic convergence and greater generalizability to unseen layouts and widget densities. Application of chain-of-thought prompting degrades grounding performance, likely due to distraction from visual localization (Tang et al., 21 Jul 2025).

Gaussian rewards provide a paradigm shift from sparse classification to dense, geometry-aware reward modeling in GUI spatial reasoning. This enables more informative gradient signals across the interaction plane, accelerating training and improving robustness to layout shifts and semantic variations. Empirical results demonstrate superiority to both distance-only and binary reward schemes.

A plausible implication is that further integration of probabilistic spatial priors—especially those grounded in human-computer interaction theory—may advance generalization in future GUI agents. As V2P and GUI-G $^2$ frameworks substantiate, Fitts' Law-inspired adaptive variance and patch-wise normalization are foundational to overcoming attention drift and achieving robust, precise interface localization in real-world deployments (Tang et al., 21 Jul 2025, Chen et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding (2025)

V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gaussian Rewards for GUI Interaction.

Gaussian Rewards in GUI Interaction

1. Foundations of Gaussian Reward Modeling in GUI Grounding

2. Gaussian Point and Coverage Rewards

3. Adaptive Variance Mechanisms and Fitts’ Law Connections

4. Reinforcement and Attention Optimization Frameworks

5. Empirical Evaluation and Benchmarks

6. Ablation Studies and Robustness Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gaussian Rewards in GUI Interaction

1. Foundations of Gaussian Reward Modeling in GUI Grounding

2. Gaussian Point and Coverage Rewards

3. Adaptive Variance Mechanisms and Fitts’ Law Connections

4. Reinforcement and Attention Optimization Frameworks

5. Empirical Evaluation and Benchmarks

6. Ablation Studies and Robustness Analysis

7. Significance, Implications, and Related Paradigms

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research