Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gaussian Rewards in GUI Interaction

Updated 22 January 2026
  • The paper introduces Gaussian rewards for GUI interaction by modeling elements as 2D Gaussian distributions to replace sparse, binary feedback.
  • It combines Gaussian point and coverage rewards to enhance element localization through precise spatial alignment and overlap measurement.
  • Adaptive variance mechanisms inspired by Fitts’ Law adjust to element dimensions, leading to improved training efficiency and generalizability.

Gaussian Rewards for GUI Interaction

Graphical User Interface (GUI) interaction agents increasingly rely on precise spatial grounding techniques to translate natural language instructions into actionable interface locations. Central to recent advances is the modeling of GUI elements as continuous spatial probability distributions using two-dimensional Gaussians. This paradigm enables the replacement of sparse, binary rewards with dense, adaptive signals that robustly guide reinforcement and attention-based learning frameworks. Such "Gaussian rewards"—inspired by the statistical patterns of human clicking behavior—shape both policy optimization and attention calibration for improved element localization, generalizability, and training efficiency across diverse interface layouts.

1. Foundations of Gaussian Reward Modeling in GUI Grounding

Traditional GUI grounding algorithms employed binary reward structures that treated element selection as hit-or-miss: successful interactions yielded positive feedback if the predicted point resided within a target bounding box, otherwise zero. This approach produces extremely sparse gradients and neglects the spatial uncertainty inherent in human interaction. Motivated by empirical findings that human pointing endpoints conform approximately to Gaussian distributions centered on target elements, Gaussian rewards encode a principled measure of spatial proximity and alignment.

A 2D Gaussian representing an element with bounding box b=[x1,y1,x2,y2]b = [x_1, y_1, x_2, y_2] is defined by its centroid μ=(cx,cy)=((x1+x2)/2,(y1+y2)/2)\mu = (c_x, c_y) = ((x_1 + x_2)/2, (y_1 + y_2)/2) and diagonal covariance matrix Σ=diag(σx2,σy2)\Sigma = \operatorname{diag}(\sigma_x^2, \sigma_y^2), with density:

N(x;μ,Σ)=12πΣexp(12(xμ)Σ1(xμ))N(x; \mu, \Sigma) = \frac{1}{2\pi\sqrt{|\Sigma|}} \exp\left(-\frac{1}{2} (x-\mu)^\top \Sigma^{-1}(x-\mu)\right)

This formalism forms the basis for continuous, differentiable reward signals for localized GUI interaction tasks (Tang et al., 21 Jul 2025).

2. Gaussian Point and Coverage Rewards

Gaussian rewards are operationalized via two synergistic mechanisms:

  • Gaussian Point Reward: Measures the precision of the predicted click center μp\mu_p relative to the ground-truth Gaussian. The reward is the value of the ground-truth density at μp\mu_p:

Rpoint=N(μp;μgt,Σgt)=exp(12(cxpcxgt)2σxgt212(cypcygt)2σygt2)R_\text{point} = N(\mu_p; \mu_\text{gt}, \Sigma_\text{gt}) = \exp \left( -\frac{1}{2} \frac{(c_{xp} - c_{xgt})^2}{\sigma_{xgt}^2} - \frac{1}{2} \frac{(c_{yp} - c_{ygt})^2}{\sigma_{ygt}^2} \right)

  • Gaussian Coverage Reward: Measures spatial alignment via the Bhattacharyya coefficient, quantifying overlap between the predicted and ground-truth Gaussian distributions:

BC(Np,Ngt)=N(x;μp,Σp)N(x;μgt,Σgt)dxBC(N_p, N_\text{gt}) = \int \sqrt{N(x; \mu_p, \Sigma_p) N(x; \mu_\text{gt}, \Sigma_\text{gt})} \, dx

This overlap admits a closed form:

Rcoverage=exp[18(μpμgt)Σˉ1(μpμgt)12ln(detΣˉdetΣpdetΣgt)]R_\text{coverage} = \exp \left[ -\frac{1}{8} (\mu_p - \mu_\text{gt})^\top \bar{\Sigma}^{-1} (\mu_p - \mu_\text{gt}) -\frac{1}{2} \ln \left( \frac{\det \bar{\Sigma}}{\sqrt{\det\Sigma_p \det\Sigma_\text{gt}}} \right) \right]

with Σˉ=(Σp+Σgt)/2\bar{\Sigma} = (\Sigma_p + \Sigma_\text{gt}) / 2 (Tang et al., 21 Jul 2025).

These dual signals enable both fine-grained localization and robust alignment with target regions.

3. Adaptive Variance Mechanisms and Fitts’ Law Connections

Fixed variance values σ\sigma do not account for the heterogeneity in GUI element sizes. Both GUI-G2^2 and V2P frameworks implement adaptive variance mechanisms that tie the Gaussian spread to the element's width and height:

  • GUI-G2^2: σx=α(x2x1),  σy=α(y2y1)\sigma_x = \alpha \cdot (x_2 - x_1),\; \sigma_y = \alpha \cdot (y_2 - y_1), with a default α=0.5\alpha = 0.5 ("2σ\sigma principle").
  • V2P: σx=w/σfactor,  σy=h/σfactor\sigma_x = w / \sigma_\text{factor},\; \sigma_y = h / \sigma_\text{factor}, optimized with σfactor=1.0\sigma_\text{factor} = 1.0 for challenging benchmarks (Chen et al., 11 Jan 2026).

This scaling mimics human motor variability modeled in Fitts’ Law, where endpoint distributions are roughly Gaussian with standard deviation proportional to target width or height, thus capturing the interaction uncertainty intrinsic to GUI actions. For larger elements, Gaussians become broader, whereas smaller targets induce sharper peaks (Chen et al., 11 Jan 2026).

4. Reinforcement and Attention Optimization Frameworks

Gaussian rewards can be integrated into reinforcement learning and attention calibration pipelines:

  • Policy Gradient Formulation (GUI-G2^2): Rewards Rtotal=νRpoint+γRcoverageR_\text{total} = \nu R_\text{point} + \gamma R_\text{coverage} guide actor-critic models using Group Relative Policy Optimization (GRPO)—a variant of PPO. Rewards are computed for multiple candidate responses, normalized by mean and standard deviation, and used in the clipped policy objective with additional KL regularization.
  • Attention Calibration (V2P): GUI interaction images are split into MM non-overlapping patches. 2D Gaussian mass per patch is computed via CDF integration. The model's attention vector aa is aligned to the normalized Gaussian prior pp using a KL-divergence loss. Simultaneously, a suppression attention mechanism penalizes background regions, forming a “valley” surrounding the target region and reducing attention drift.

The total V2P objective is:

L=LNTP+λ1LSupAttn+λ2LActionAttn\mathcal{L} = \mathcal{L}_\text{NTP} + \lambda_1\,\mathcal{L}_\text{SupAttn} + \lambda_2\,\mathcal{L}_\text{ActionAttn}

where LNTP\mathcal{L}_\text{NTP} is the standard next-token prediction loss (Chen et al., 11 Jan 2026).

5. Empirical Evaluation and Benchmarks

Robust evaluation is performed on multiple GUI grounding datasets:

  • ScreenSpot and ScreenSpot-v2 (diverse device types and layouts)
  • ScreenSpot-Pro (high-resolution, professional interfaces)

Metrics center on the percentage of predictions whose predicted center falls within the ground-truth bounding box. Results indicate:

  • GUI-G2^2: 92.0%92.0\% (ScreenSpot), 93.3%93.3\% (ScreenSpot-v2), 47.5%47.5\% (ScreenSpot-Pro)
  • V2P-7B: 92.4%92.4\% (ScreenSpot-v2), 52.5%52.5\% (ScreenSpot-Pro)

Relative gains are substantial compared to prior methods:

  • GUI-G2^2 improves +24.7%+24.7\% over UI-TARS-72B on ScreenSpot-Pro despite using 10×10\times fewer parameters (Tang et al., 21 Jul 2025).
  • V2P achieves a $5$ point accuracy gain from its Fitts-Law-guided Gaussian peak on challenging benchmarks (Chen et al., 11 Jan 2026).

6. Ablation Studies and Robustness Analysis

Comprehensive ablations isolate the contributions of Gaussian reward components and adaptive mechanisms. Key findings include:

Ablation Variant Accuracy (%) Interpretation
Full GUI-G2^2 (point + coverage) 93.3 Best accuracy; both signals needed for precision & overlap
Point only 90.2 Localizes but weaker regional alignment
Coverage only 92.1 Strong overlap but less center precision
Restrict Gaussian to inside-box 88.4 Loses early-training guidance
Fixed σ\sigma 87.8 Under-constrains large elements/over-constrains small icons
2σ\sigma principle (α=0.5\alpha=0.5) 92.9–93.3 Optimal adaptive variance
V2P w/o Gaussian peak 47.5 5-5 pt relative loss on ScreenSpot-Pro
V2P w/o peak & suppression 44.3 8.2-8.2 pt relative loss

Continuous, everywhere-applied Gaussian rewards yield smoother, monotonic convergence and greater generalizability to unseen layouts and widget densities. Application of chain-of-thought prompting degrades grounding performance, likely due to distraction from visual localization (Tang et al., 21 Jul 2025).

Gaussian rewards provide a paradigm shift from sparse classification to dense, geometry-aware reward modeling in GUI spatial reasoning. This enables more informative gradient signals across the interaction plane, accelerating training and improving robustness to layout shifts and semantic variations. Empirical results demonstrate superiority to both distance-only and binary reward schemes.

A plausible implication is that further integration of probabilistic spatial priors—especially those grounded in human-computer interaction theory—may advance generalization in future GUI agents. As V2P and GUI-G2^2 frameworks substantiate, Fitts' Law-inspired adaptive variance and patch-wise normalization are foundational to overcoming attention drift and achieving robust, precise interface localization in real-world deployments (Tang et al., 21 Jul 2025, Chen et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gaussian Rewards for GUI Interaction.