Dice Question Streamline Icon: https://streamlinehq.com

Leveraging response-dependent constants in the PM regularizer

Develop a principled method to leverage dependence on both the prompt x and response y via a nonzero constant term C_{2,x,y} in the preference-matching regularizer R_{x,y}(π) = −log π + C_{1,x} + C_{2,x,y}/π so that the resulting PM RLHF achieves good practical performance while preserving the preference-matching property under the Plackett–Luce model.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper generalizes the PM regularizer to allow dependence on both the prompt and the response, showing that the necessary and sufficient form for preference matching is R_{x,y}(π) = −log π + C_{1,x} + C_{2,x,y}/π. While the last term does not affect the optimization objective in expectation, its specific choice may influence practical training dynamics and performance.

The authors note that although this response-dependent form admits preference matching, it remains unclear how to choose a nonzero C_{2,x,y} in practice to yield good empirical performance, and they suggest one possible direction (e.g., C_{2,x,y} proportional to a fixed reference policy) as a starting point.

References

It is not clear how to leverage the dependence on both the prompt and response for a choice of nonzero C_{2,x,y} with good practical performance. We leave this question for future research.

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization (2405.16455 - Xiao et al., 26 May 2024) in Section 3.4 (Extension to Response-Dependent Regularization)