RVPO: Risk-Sensitive Alignment via Variance Regularization

Published 7 May 2026 in cs.LG and cs.CL | (2605.05750v1)

Abstract: Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel risk-sensitive aggregation approach that penalizes reward variance to prevent constraint neglect in multi-objective RLHF.
It employs a soft-min operator to balance mean aggregation with strict minimum enforcement, achieving robust performance across health and tool-calling benchmarks.
Empirical results demonstrate enhanced bottleneck constraint adherence and training stability, significantly boosting scores and accelerating convergence compared to GRPO and GDPO.

RVPO: Risk-Sensitive Alignment via Variance Regularization

Motivation and Problem Statement

The paper addresses constraint neglect in critic-less RLHF for LLMs performing multi-objective alignment. Existing aggregation schemes (arithmetic mean in GRPO and GDPO) induce loss compensation: the optimizer is blind to bottleneck objective failures when high-magnitude rewards on easier objectives are present. This produces policies that over-optimize unconstrained metrics (e.g., verbosity) at the cost of safety, formatting, or completeness requirements. Robust RLHF for LLMs requires policies to satisfy all constraints, especially the strictest or lowest-performing one, without explicit constraint specification or costly memory overhead from value networks.

RVPO Formulation

RVPO introduces a risk-sensitive reward aggregation mechanism using variance regularization. Standardized rewards $Z_j$ across $M$ objectives are aggregated not by their mean, but by a variance-penalized objective:

$A_{RVPO\text{-}explicit}^{(g)} = \mu_Z^{(g)} - \beta \cdot (\sigma_Z^{(g)})^2$

To avoid instability and excessive penalty in high-dimensional spaces (especially when $M$ is small or heterogeneous), RVPO replaces explicit variance penalization with a LogSumExp (SoftMin) operator:

$A_{RVPO}^{(g)} = -\frac{1}{k} \ln \left( \frac{1}{M} \sum_{j=1}^M e^{-k Z_j^{(g)}} \right)$

Here, the risk coefficient $k$ interpolates between mean aggregation ( $k \to 0$ ) and strict minimum ( $k \to \infty$ ). Mathematically, RVPO is a proper generalization of GDPO and allows smooth control over the mean-min tradeoff. Second-order Taylor expansion demonstrates that the SoftMin is a robust proxy for variance penalization (with $\beta = k/2$ ), aligning optimization pressure with the most difficult constraints.

Empirical Evaluation

RVPO is evaluated against GRPO and GDPO across two domains:

Rubrics-as-Rewards (RaR): LLM-judged multi-axis rubrics (5–17 criteria) for medical and scientific reasoning (HealthBench and GPQA-Diamond benchmarks).
Tool Calling (RLLA-4k): Two reward signals (execution correctness, format compliance) in rule-based function-calling trajectories.

RVPO is implemented via Verl/TRL on Qwen2.5 models (1.5B–14B), with risk coefficient $k$ annealed during training. The evaluation follows micro-averaged rubric scores for HealthBench and BFCL-v3 metrics for tool calling. All methods are trained and tested under identical compute and data regimes.

Figure 1: Per-axis performance at the optimal checkpoint on HealthBench (Medicine, Qwen2.5-7B); RVPO redistributes optimization pressure to bottleneck axes and achieves higher overall scores than GDPO.

RVPO consistently improves adherence to bottleneck constraints across model scales. On HealthBench, arithmetic aggregation with GDPO over-optimizes Communication Quality and Instruction Following but neglects Completeness and Context Awareness. RVPO’s variance penalty substantially increases scores on the stricter axes, raising Completeness from 11.1% to 15.2% and Accuracy from 30.0% to 33.3%. The overall score at 14B reaches 0.261 (vs. 0.215 for GDPO, $M$ 0). Crucially, RVPO avoids late-stage training collapse observed in GDPO and mean-based baselines, maintaining stability and robust alignment.

Tool Calling: Training Dynamics

Tool-calling experiments further highlight RVPO’s superiority in rapid convergence and constraint satisfaction. With GRPO and GDPO, the format adherence constraint is satisfied only late in training due to loss compensation (Figure 2). RVPO ensures simultaneous improvement in execution correctness and format compliance, reducing inter-run variance and accelerating convergence.

Figure 2: Tool Calling (RLLA) training progression; RVPO outperforms baselines in enforcing formatting constraints alongside execution accuracy.

Hyperparameter Sensitivity

The risk coefficient $M$ 1 (inverse temperature in soft-min) is a critical hyperparameter for curriculum robustness. High constant $M$ 2 schedules boost peak performance by tightening bottleneck pressure, but induce instability; low $M$ 3 schedules are stable but less performant. Annealing $M$ 4 from low to moderate values yields optimal stability and peak scores, allowing general capabilities to be established before variance penalization dominates.

Figure 3: Risk coefficient sweep on HealthBench; annealed $M$ 5 schedules best balance peak performance and training stability.

RVPO’s soft-min formulation demonstrates enhanced robustness over explicit variance penalties ( $M$ 6). Empirical variance with fixed $M$ 7 is highly sensitive to reward space dimensionality and produces non-monotonic results, confirming theoretical expectations (Figure 4).

Figure 4: Explicit variance penalty sweep; RVPO’s soft-min achieves greater robustness in optimization stability and sensitivity.

Practical and Theoretical Implications

RVPO generalizes mean aggregation in multi-reward RL by penalizing reward variance, thus robustly addressing constraint neglect without explicit constraint thresholds or multiple policy training. Its curriculum-driven optimization allows for scalable, risk-sensitive alignment across high-dimensional and dynamically varying reward spaces. The absence of value networks reduces memory and compute overhead, facilitating deployment. RVPO’s dynamic prioritization of bottleneck constraints yields policies with strict adherence to safety, formatting, and completeness, necessary for LLMs in practical, real-world settings.

RVPO’s flexible risk coefficient offers explicit control over objective prioritization, supporting advanced curriculum learning and robust multi-objective tuning. Its theoretical foundation in risk-sensitive MDPs and smooth interpolation between mean and min aggregation opens pathways for deeper exploration of difficulty-aware policy optimization, adaptive risk scheduling, and integration with prioritized reward weighting schemes.

Future Directions

Key future avenues include adaptive risk coefficient scheduling, integration with weighted reward systems to reconcile empirical difficulty and declared priorities, and further investigation into noise amplification and reward model reliability in variance-penalized regimes. RVPO’s formulation could enable more interpretable multi-objective policies, robust safety alignment, and Pareto-optimality in large-scale LLMs. It is likely to serve as a foundation for further research into single-run Pareto alignment and scalable, risk-sensitive RLHF.

Conclusion

RVPO introduces a scalable, risk-sensitive aggregation framework that fundamentally mitigates constraint neglect in critic-less RLHF for LLMs. By dynamically penalizing reward variance using the LogSumExp operator, RVPO aligns policies towards consistent, bottleneck-oriented behavior across model scales and domains, improving both peak and final constraint satisfaction. The explicit risk coefficient provides granularity in balancing general vs. strict objectives, with annealed curricula yielding robust training. As the field moves toward decomposed reward models and modular constraints, variance regularization as instantiated by RVPO will remain essential for reliable multi-objective LLM alignment.

Markdown Report Issue