Frontier-scale validity of Random sampling vs Active Preference Learning in online DPO

Determine whether the empirical finding that uncertainty-based Active Preference Learning provides little to no advantage over random sampling in online Direct Preference Optimization persists for frontier-scale large language models with at least 70 billion parameters.

Background

The paper presents a controlled empirical study showing that uncertainty-based Active Preference Learning (APL) offers little to no consistent advantage over simple Random sampling for online Direct Preference Optimization (DPO) across harmlessness, helpfulness, and instruction-following tasks, using models up to 7B parameters. The authors observe evaluator-dependent failure modes and note that strong pretraining priors and the richness of the on-policy candidate pool make Random sampling a formidable baseline, while APL incurs substantial computational overhead.

However, all experiments are limited to models with 7B parameters or fewer. The authors explicitly state that it remains unknown whether their conclusion holds at frontier scales (≥70B), where stronger priors might further reduce the marginal benefit of active selection, or alternatively, different alignment dynamics could emerge. This raises the open question of whether their negative result generalizes to much larger models.

References

Whether the same conclusion holds at frontier scales ($\geq$70B) remains an open question: stronger priors may further diminish the headroom for active selection, but the alignment dynamics of much larger models could also differ in ways that are difficult to predict without direct evaluation.

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs  (2604.02766 - Oh et al., 3 Apr 2026) in Section: Limitations