Gaussian Rewards in GUI Interaction
- The paper introduces Gaussian rewards for GUI interaction by modeling elements as 2D Gaussian distributions to replace sparse, binary feedback.
- It combines Gaussian point and coverage rewards to enhance element localization through precise spatial alignment and overlap measurement.
- Adaptive variance mechanisms inspired by Fitts’ Law adjust to element dimensions, leading to improved training efficiency and generalizability.
Gaussian Rewards for GUI Interaction
Graphical User Interface (GUI) interaction agents increasingly rely on precise spatial grounding techniques to translate natural language instructions into actionable interface locations. Central to recent advances is the modeling of GUI elements as continuous spatial probability distributions using two-dimensional Gaussians. This paradigm enables the replacement of sparse, binary rewards with dense, adaptive signals that robustly guide reinforcement and attention-based learning frameworks. Such "Gaussian rewards"—inspired by the statistical patterns of human clicking behavior—shape both policy optimization and attention calibration for improved element localization, generalizability, and training efficiency across diverse interface layouts.
1. Foundations of Gaussian Reward Modeling in GUI Grounding
Traditional GUI grounding algorithms employed binary reward structures that treated element selection as hit-or-miss: successful interactions yielded positive feedback if the predicted point resided within a target bounding box, otherwise zero. This approach produces extremely sparse gradients and neglects the spatial uncertainty inherent in human interaction. Motivated by empirical findings that human pointing endpoints conform approximately to Gaussian distributions centered on target elements, Gaussian rewards encode a principled measure of spatial proximity and alignment.
A 2D Gaussian representing an element with bounding box is defined by its centroid and diagonal covariance matrix , with density:
This formalism forms the basis for continuous, differentiable reward signals for localized GUI interaction tasks (Tang et al., 21 Jul 2025).
2. Gaussian Point and Coverage Rewards
Gaussian rewards are operationalized via two synergistic mechanisms:
- Gaussian Point Reward: Measures the precision of the predicted click center relative to the ground-truth Gaussian. The reward is the value of the ground-truth density at :
- Gaussian Coverage Reward: Measures spatial alignment via the Bhattacharyya coefficient, quantifying overlap between the predicted and ground-truth Gaussian distributions:
This overlap admits a closed form:
with (Tang et al., 21 Jul 2025).
These dual signals enable both fine-grained localization and robust alignment with target regions.
3. Adaptive Variance Mechanisms and Fitts’ Law Connections
Fixed variance values do not account for the heterogeneity in GUI element sizes. Both GUI-G and V2P frameworks implement adaptive variance mechanisms that tie the Gaussian spread to the element's width and height:
- GUI-G: , with a default ("2 principle").
- V2P: , optimized with for challenging benchmarks (Chen et al., 11 Jan 2026).
This scaling mimics human motor variability modeled in Fitts’ Law, where endpoint distributions are roughly Gaussian with standard deviation proportional to target width or height, thus capturing the interaction uncertainty intrinsic to GUI actions. For larger elements, Gaussians become broader, whereas smaller targets induce sharper peaks (Chen et al., 11 Jan 2026).
4. Reinforcement and Attention Optimization Frameworks
Gaussian rewards can be integrated into reinforcement learning and attention calibration pipelines:
- Policy Gradient Formulation (GUI-G): Rewards guide actor-critic models using Group Relative Policy Optimization (GRPO)—a variant of PPO. Rewards are computed for multiple candidate responses, normalized by mean and standard deviation, and used in the clipped policy objective with additional KL regularization.
- Attention Calibration (V2P): GUI interaction images are split into non-overlapping patches. 2D Gaussian mass per patch is computed via CDF integration. The model's attention vector is aligned to the normalized Gaussian prior using a KL-divergence loss. Simultaneously, a suppression attention mechanism penalizes background regions, forming a “valley” surrounding the target region and reducing attention drift.
The total V2P objective is:
where is the standard next-token prediction loss (Chen et al., 11 Jan 2026).
5. Empirical Evaluation and Benchmarks
Robust evaluation is performed on multiple GUI grounding datasets:
- ScreenSpot and ScreenSpot-v2 (diverse device types and layouts)
- ScreenSpot-Pro (high-resolution, professional interfaces)
Metrics center on the percentage of predictions whose predicted center falls within the ground-truth bounding box. Results indicate:
- GUI-G: (ScreenSpot), (ScreenSpot-v2), (ScreenSpot-Pro)
- V2P-7B: (ScreenSpot-v2), (ScreenSpot-Pro)
Relative gains are substantial compared to prior methods:
- GUI-G improves over UI-TARS-72B on ScreenSpot-Pro despite using fewer parameters (Tang et al., 21 Jul 2025).
- V2P achieves a $5$ point accuracy gain from its Fitts-Law-guided Gaussian peak on challenging benchmarks (Chen et al., 11 Jan 2026).
6. Ablation Studies and Robustness Analysis
Comprehensive ablations isolate the contributions of Gaussian reward components and adaptive mechanisms. Key findings include:
| Ablation Variant | Accuracy (%) | Interpretation |
|---|---|---|
| Full GUI-G (point + coverage) | 93.3 | Best accuracy; both signals needed for precision & overlap |
| Point only | 90.2 | Localizes but weaker regional alignment |
| Coverage only | 92.1 | Strong overlap but less center precision |
| Restrict Gaussian to inside-box | 88.4 | Loses early-training guidance |
| Fixed | 87.8 | Under-constrains large elements/over-constrains small icons |
| 2 principle () | 92.9–93.3 | Optimal adaptive variance |
| V2P w/o Gaussian peak | 47.5 | pt relative loss on ScreenSpot-Pro |
| V2P w/o peak & suppression | 44.3 | pt relative loss |
Continuous, everywhere-applied Gaussian rewards yield smoother, monotonic convergence and greater generalizability to unseen layouts and widget densities. Application of chain-of-thought prompting degrades grounding performance, likely due to distraction from visual localization (Tang et al., 21 Jul 2025).
7. Significance, Implications, and Related Paradigms
Gaussian rewards provide a paradigm shift from sparse classification to dense, geometry-aware reward modeling in GUI spatial reasoning. This enables more informative gradient signals across the interaction plane, accelerating training and improving robustness to layout shifts and semantic variations. Empirical results demonstrate superiority to both distance-only and binary reward schemes.
A plausible implication is that further integration of probabilistic spatial priors—especially those grounded in human-computer interaction theory—may advance generalization in future GUI agents. As V2P and GUI-G frameworks substantiate, Fitts' Law-inspired adaptive variance and patch-wise normalization are foundational to overcoming attention drift and achieving robust, precise interface localization in real-world deployments (Tang et al., 21 Jul 2025, Chen et al., 11 Jan 2026).