Accelerating LLM-as-judge training judgments without degrading quality
Develop computationally efficient large-language-model-as-judge evaluation schemes for Self-Improving Pretraining that avoid exhaustive pairwise comparisons among K candidate completions during online Direct Preference Optimization (DPO), while preserving or improving the training effectiveness and final generation quality observed with full pairwise judgments; specifically, devise alternatives to the pivot-based comparison used for suffix quality/factuality selection that reduce judgment cost without the performance deterioration reported by the authors.
References
Overall we find deterioration in performance from using pivots, leaving how to make judgments faster while maintaining quality an open question.
— Self-Improving Pretraining: using post-trained models to pretrain better models
(2601.21343 - Tan et al., 29 Jan 2026) in Subsubsection “Pivots in pairwise comparison judgments,” Section “Analysis and ablations”