Cause of train–test reward mismatch in RLER training
Investigate whether the observed discrepancy between high training rewards and lower downstream evaluation performance in DR Tulu results from mismatches between training-time tasks, rubric designs, and evaluation protocols, and characterize conditions under which such mismatches occur.
References
We conjecture that this stems from a mismatch between the tasks, rubrics, and evaluation setups of the external benchmarks vs.~what we used for training. We defer a deeper investigation of this mismatch phenomenon to future work, as addressing it would help improve the effectiveness of rubrics for RL training.
— DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
(2511.19399 - Shao et al., 24 Nov 2025) in Discussion and Future Work, subsection "The train-test mismatch challenge"