Uncertainty-adjusted Group Relative Policy Optimization

Updated 11 September 2025

The paper introduces UARPO as a reinforcement learning framework that incorporates uncertainty estimates into group-relative policy optimization for more stable learning.
It adjusts policy gradients using uncertainty-adjusted relative advantages, enabling conservative updates in high-uncertainty scenarios and enhancing sample efficiency.
Empirical results demonstrate that UARPO can boost prediction accuracy by approximately 13.48% in financial forecasting and achieve robust performance in robotics and language modeling.

Uncertainty-adjusted Group Relative Policy Optimization (UARPO) is an advanced reinforcement learning (RL) framework that incorporates uncertainty quantification and group-relative metrics into policy optimization algorithms. UARPO generalizes and extends Group Relative Policy Optimization (GRPO) by explicitly modulating policy gradients or update magnitudes based on systematic uncertainty estimates, allowing for more conservative updates in high-uncertainty scenarios and finer adaptation in multi-agent, multi-modal, or high-stakes prediction settings.

1. Theoretical Foundations and Motivation

UARPO arises from the recognition that standard RL approaches often face instability, policy overfitting, and poor generalization when confronted with model or environment uncertainty. Classical model-based RL methods utilize a learned dynamics model to simulate trajectories for policy learning but are susceptible to model bias—errors that are exacerbated when uncertainty estimates are ignored (Vuong et al., 2019).

Group Relative Policy Optimization introduces a further refinement by comparing performance across candidate action groups or agent clusters, using relative, rather than absolute, advantage measures. UARPO embeds uncertainty quantification directly into this paradigm, integrating uncertainty-adjusted relative advantages at the group or token level for gradient modulation (Wang et al., 10 Sep 2025).

The general UARPO scheme can be formalized as optimizing an objective of the form:

$J_{UARPO}(\theta) = \mathbb{E}\left[ \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left\{ r_{i,t}(\theta) \cdot \widehat{A}^I_{i,t} \cdot \widehat{U}_{i,t},\ \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \cdot \widehat{A}^I_{i,t} \cdot \widehat{U}_{i,t} \right\} - \beta D_{KL}[\pi_\theta \| \pi_{ref}] \right]$

where advantage and uncertainty terms can be instantiated along in-group, cross-group, and sample-wise axes.

2. Uncertainty Quantification and Integration

a. Uncertainty Estimation

UARPO incorporates uncertainty through various mechanisms—predictive variance in learned dynamics models, epistemic uncertainty over Q-values, semantic entropy for LLMs, or explicit confidence scores in multimodal architectures:

Model-based RL: Predictive posterior distributions or model ensembles are used to estimate the transition kernel uncertainty, which is then regularized or penalized in the policy gradient computation (Vuong et al., 2019, Zhou et al., 2019).
Value-based RL: Tight upper bounds on Q-value variance are propagated via a Bellman-style recursion, yielding a local and global estimate of uncertainty for policy updates (Zhou et al., 2019).
Language/Multimodal Models: Semantic entropy is measured over sampled outputs by clustering generated answers and applying information-theoretic metrics, or by leveraging token-level confidence predictions (Chen et al., 18 May 2025, Wang et al., 10 Sep 2025).

b. Uncertainty-Adjusted Advantage

The advantage used for policy updates in UARPO is systematically scaled by the uncertainty estimate. In settings where semantic entropy is the uncertainty measure, this yields:

$\widehat{A}_i = A_i \cdot f\left(\alpha \frac{SE(q)}{SE_{max}}\right)$

where $A_i$ is the baseline group-relative advantage, $SE(q)$ is semantic entropy for prompt $q$ , and $f$ is a weighting function (linear, exponential, or focal mask) (Chen et al., 18 May 2025).

In multimodal financial forecasting, a more complex uncertainty-adjusted group advantage is realized:

$\widehat{A}^I_{i,t} = \frac{r_i - \overline{r}}{\mathrm{std}(r)} \qquad \widehat{U}_{i,t} = \alpha \cdot (\text{score}_{i,t} - \gamma)$

with $r_i$ denoting sample reward and $\text{score}_{i,t}$ as the model's confidence output at token $t$ (Wang et al., 10 Sep 2025).

3. Policy Optimization Algorithms

UARPO extends PPO-style surrogate objectives by introducing relative, group-wise, and uncertainty-adjusted policy constraints. Updates commonly proceed as follows:

Sample a group of $G$ outputs/responses/actions for each input or state.
Compute individual and group-average rewards.
Calculate in-group and, where applicable, cross-group standardized advantages.
Modulate the update by the uncertainty estimate per sample/group.
Apply a clipped surrogate objective, often including a KL divergence penalty for trust region regularization:

$\mathcal{L} = \mathbb{E}_{i}[ \min(r_{\theta} \widehat{A}_i, \text{clip}(r_{\theta},1-\epsilon,1+\epsilon)\widehat{A}_i) - \beta D_{KL}[\pi_{\theta}||\pi_{ref}]]$

In high-dimensional control or planning domains, UARPO can be integrated with TD learning and explicit trust-region constraints in the latent policy space, using softmax-based relative advantages across action candidates and enforcing KL penalties to bound policy divergence (Nguyen et al., 19 May 2025).

4. Applications Across Modalities and Domains

UARPO finds application in diverse RL contexts where reasoning about and exploiting uncertainty is crucial:

Multimodal Financial Forecasting: In the FinZero model, UARPO enables both accurate price/volatility forecasting and explicit uncertainty analysis by operating over financial image-text pairs, leading to measurable improvements over strong LLM baselines in high-confidence segments (Wang et al., 10 Sep 2025).
Robotics and Continuous Control: UARPO is instantiated using model-based RL with uncertainty-aware dynamics models and/or group-level advantage functions, improving sample efficiency and robustness in challenging continuous control and sim-to-real transfer (Vuong et al., 2019, Ilboudo et al., 7 Oct 2024).
Humanoid Locomotion: The TD-GRPC framework leverages group comparison and uncertainty-adjusted constraints to stabilize learning and mitigate policy mismatch in high-DoF settings (Nguyen et al., 19 May 2025).
LLM Fine-Tuning: In LLMs, incorporating semantic entropy in UARPO-type objective functions yields state-of-the-art mathematical reasoning accuracy by weighting updates according to prompt-level uncertainty (Chen et al., 18 May 2025).
Personalized Medical Intervention: UARPO is proposed as a refinement to GRPO for clinical decision support, enabling safety-aware and robust interventions under high data heterogeneity and noisy multi-source signals (Lu et al., 25 Apr 2025).

5. Empirical Results and Impact

Quantitative experiments consistently report that UARPO outperforms baseline methods, both in terms of final accuracy and sample efficiency. For example, in FinZero, UARPO achieves an approximate 13.48% increase in prediction accuracy over GPT-4o on high-confidence financial forecasts (Wang et al., 10 Sep 2025). In mathematical reasoning, uncertainty-aware GRPO achieves top-tier Pass@1 scores on advanced benchmarks (Chen et al., 18 May 2025). In continuous control benchmarks, UARPO-style frameworks yield higher sample efficiency and more robust performance compared to prior state-of-the-art model-based and model-free RL algorithms (Vuong et al., 2019, Zhou et al., 2019, Queeney et al., 2020).

A common observation is a positive correlation between the model's confidence (or uncertainty estimate) and the reliability of its predictions—an effect directly promoted by UARPO's weighting schemes.

6. Limitations and Open Directions

While UARPO provides systematic tools for incorporating uncertainty into RL, several challenges remain:

Uncertainty Aggregation: In multi-agent or group settings, defining aggregation functions for uncertainty across agents or subpopulations is non-trivial and may raise fairness or efficiency issues (Vuong et al., 2019, Ilboudo et al., 7 Oct 2024).
Computational Overhead: The need to sample multiple trajectories/actions per update, estimate reliable uncertainty, and backpropagate through complex models increases computational cost, especially in high-dimensional or high-frequency domains (Nguyen et al., 19 May 2025).
Deployment and Calibration: Accurate calibration of uncertainty estimates (e.g., confidence scores or entropy measures) is critical for performance, yet remains sensitive to model architecture and dataset bias (Wang et al., 10 Sep 2025, Chen et al., 18 May 2025).
Trade-off Tuning: Hyperparameters controlling the impact of uncertainty in update modulation (such as scaling factors and clipping thresholds) require domain-specific tuning for optimal effectiveness.

Future work is suggested in further refining uncertainty quantification, exploring adaptive regularization strategies (e.g., dynamic trust region constraints or adaptive entropy weights), and extending UARPO frameworks for robust transfer in non-stationary or adversarial environments.

UARPO synthesizes principles from several RL subfields:

Model-based RL with Uncertainty: Originating from uncertainty-aware model learning and dynamics propagation (Vuong et al., 2019, Zhou et al., 2019), UARPO systematically propagates epistemic and aleatoric uncertainties into policy updates.
Multi-objective RL: By casting domain or subgroup performance as independent objectives, UARPO leverages convex coverage set learning and scalarization approaches from multi-objective reinforcement learning to balance trade-offs under uncertainty (Ilboudo et al., 7 Oct 2024).
Trust-region and Robust Optimization: UARPO is closely related to robust trust-region policy optimization approaches that adjust the trust region scale according to finite-sample or epistemic uncertainty (Queeney et al., 2020).
Relative Advantage and Group Comparison: UARPO formalizes group-wise advantage (i.e., relative to in-group or cross-group baselines) as the core learning signal, aligning with recent advances in stable RL for high-dimensional control (Nguyen et al., 19 May 2025, Chen et al., 18 May 2025).

Plausibly, UARPO represents a unifying framework for risk-sensitive, robust, and sample-efficient policy optimization in complex, uncertain, and multi-agent RL environments, supported by extensive empirical and theoretical analysis across multiple domains.