Frontier-scale validity of Random sampling vs Active Preference Learning in online DPO
Determine whether the empirical finding that uncertainty-based Active Preference Learning provides little to no advantage over random sampling in online Direct Preference Optimization persists for frontier-scale large language models with at least 70 billion parameters.
References
Whether the same conclusion holds at frontier scales ($\geq$70B) remains an open question: stronger priors may further diminish the headroom for active selection, but the alignment dynamics of much larger models could also differ in ways that are difficult to predict without direct evaluation.
— Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
(2604.02766 - Oh et al., 3 Apr 2026) in Section: Limitations