Leveraging response-dependent constants in the PM regularizer

Develop a principled method to leverage dependence on both the prompt x and response y via a nonzero constant term C_{2,x,y} in the preference-matching regularizer R_{x,y}(π) = −log π + C_{1,x} + C_{2,x,y}/π so that the resulting PM RLHF achieves good practical performance while preserving the preference-matching property under the Plackett–Luce model.

Background

The paper generalizes the PM regularizer to allow dependence on both the prompt and the response, showing that the necessary and sufficient form for preference matching is R_{x,y}(π) = −log π + C_{1,x} + C_{2,x,y}/π. While the last term does not affect the optimization objective in expectation, its specific choice may influence practical training dynamics and performance.

The authors note that although this response-dependent form admits preference matching, it remains unclear how to choose a nonzero C_{2,x,y} in practice to yield good empirical performance, and they suggest one possible direction (e.g., C_{2,x,y} proportional to a fixed reference policy) as a starting point.

References

It is not clear how to leverage the dependence on both the prompt and response for a choice of nonzero C_{2,x,y} with good practical performance. We leave this question for future research.

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization (2405.16455 - Xiao et al., 26 May 2024) in Section 3.4 (Extension to Response-Dependent Regularization)