Understanding the cause of extreme reference-model preferences in binary comparisons

Ascertain the causes of the empirically observed near-deterministic conditional probabilities produced by the SFT reference language model in binary response comparisons and rigorously characterize the role of response length in generating this effect.

Background

When analyzing preference collapse under standard KL-regularized RLHF, the authors observe that the reference model (pretrained/SFT) frequently assigns probabilities to paired responses that are extremely close to 0 or 1, which can drive collapse in the aligned policy.

They hypothesize that varying response lengths may explain this phenomenon due to multiplicative token probabilities, but they acknowledge that the exact cause is presently unclear and requires further investigation.

References

While the exact cause is unclear to us at the moment, some evidence shows that it might be attributed to varying lengths of responses.

— On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization (2405.16455 - Xiao et al., 26 May 2024) in Section 5.2 (Extreme Case: Preference Collapse in KL RLHF)

Understanding the cause of extreme reference-model preferences in binary comparisons

Sponsor

Background

References

Related Problems