Accelerating LLM-as-judge training judgments without degrading quality

Develop computationally efficient large-language-model-as-judge evaluation schemes for Self-Improving Pretraining that avoid exhaustive pairwise comparisons among K candidate completions during online Direct Preference Optimization (DPO), while preserving or improving the training effectiveness and final generation quality observed with full pairwise judgments; specifically, devise alternatives to the pivot-based comparison used for suffix quality/factuality selection that reduce judgment cost without the performance deterioration reported by the authors.

Background

Self-Improving Pretraining relies on a strong, post-trained model acting as a judge to score candidate completions (original suffix, rewritten suffix, and multiple rollouts) during online reinforcement learning updates such as DPO. For quality training, the judge conducts pairwise comparisons across all candidates, which is computationally expensive when K rollouts are sampled.

To reduce judgment cost, the authors experimented with a pivot strategy: selecting a single candidate (e.g., the suffix) and comparing all other candidates only against that pivot. In ablations, this approach led to noticeable performance degradation versus full pairwise comparisons, indicating a trade-off between speed and training quality. The authors therefore identify the need for faster judgment schemes that maintain quality as an unresolved question.

References

Overall we find deterioration in performance from using pivots, leaving how to make judgments faster while maintaining quality an open question.

Self-Improving Pretraining: using post-trained models to pretrain better models  (2601.21343 - Tan et al., 29 Jan 2026) in Subsubsection “Pivots in pairwise comparison judgments,” Section “Analysis and ablations”