Extend certified sign-preservation radius to general reward models without full-parameter gradients
Extend the certified sign-preservation radius—defined as the smallest perturbation to the reward model parameters that flips a completion’s group-relative advantage sign—to general reward models beyond linear-head architectures, while avoiding the need to compute full-parameter gradient norms for each completion during policy optimization.
References
Finally, extending the certified sign-preservation radius to general RMs without incurring the cost of full-parameter gradient norm computation remains an open problem.
— Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
(2604.02986 - Ono et al., 3 Apr 2026) in Conclusion, final paragraph