Extend certified sign-preservation radius to general reward models without full-parameter gradients

Extend the certified sign-preservation radius—defined as the smallest perturbation to the reward model parameters that flips a completion’s group-relative advantage sign—to general reward models beyond linear-head architectures, while avoiding the need to compute full-parameter gradient norms for each completion during policy optimization.

Background

The paper introduces the certified sign-preservation radius as a per-completion robustness measure: the largest parameter perturbation under which the sign of a completion’s advantage is preserved. This is used to down-weight unreliable gradient contributions in Sign-Certified Policy Optimization (SignCert-PO).

To make computation tractable, the paper derives a closed-form radius under a linear reward head and an uncertainty set over the head parameters. The authors note that for general differentiable reward models, computing the necessary per-completion full-parameter gradient norms is computationally infeasible in practice, motivating their linear-head approximation.

In the conclusion, the authors explicitly identify as an open problem the extension of this certified sign-preservation radius to general reward models without incurring the cost of full-parameter gradient norm computation, which would broaden applicability beyond the linear-head setting while maintaining computational efficiency.

References

Finally, extending the certified sign-preservation radius to general RMs without incurring the cost of full-parameter gradient norm computation remains an open problem.

— Mitigating Reward Hacking in RLHF via Advantage Sign Robustness (2604.02986 - Ono et al., 3 Apr 2026) in Conclusion, final paragraph

Extend certified sign-preservation radius to general reward models without full-parameter gradients

Background

References

Related Problems