Leveraging response-dependent constants in the PM regularizer
Develop a principled method to leverage dependence on both the prompt x and response y via a nonzero constant term C_{2,x,y} in the preference-matching regularizer R_{x,y}(π) = −log π + C_{1,x} + C_{2,x,y}/π so that the resulting PM RLHF achieves good practical performance while preserving the preference-matching property under the Plackett–Luce model.
References
It is not clear how to leverage the dependence on both the prompt and response for a choice of nonzero C_{2,x,y} with good practical performance. We leave this question for future research.
— On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization
(2405.16455 - Xiao et al., 26 May 2024) in Section 3.4 (Extension to Response-Dependent Regularization)