Understanding the cause of extreme reference-model preferences in binary comparisons
Ascertain the causes of the empirically observed near-deterministic conditional probabilities produced by the SFT reference language model in binary response comparisons and rigorously characterize the role of response length in generating this effect.
References
While the exact cause is unclear to us at the moment, some evidence shows that it might be attributed to varying lengths of responses.
— On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization
(2405.16455 - Xiao et al., 26 May 2024) in Section 5.2 (Extreme Case: Preference Collapse in KL RLHF)