Dice Question Streamline Icon: https://streamlinehq.com

Cause of train–test reward mismatch in RLER training

Investigate whether the observed discrepancy between high training rewards and lower downstream evaluation performance in DR Tulu results from mismatches between training-time tasks, rubric designs, and evaluation protocols, and characterize conditions under which such mismatches occur.

Information Square Streamline Icon: https://streamlinehq.com

Background

During development, the authors observed that higher training rewards did not consistently translate to better external benchmark scores. They hypothesize that differences between training-time judges and benchmark judges, as well as rubrics and task setups, may lead to reward hacking or misalignment.

Validating this conjecture would inform how to align training and evaluation environments, including judge choice and rubric construction, to ensure that improvements measured during training predict external performance.

References

We conjecture that this stems from a mismatch between the tasks, rubrics, and evaluation setups of the external benchmarks vs.~what we used for training. We defer a deeper investigation of this mismatch phenomenon to future work, as addressing it would help improve the effectiveness of rubrics for RL training.

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research (2511.19399 - Shao et al., 24 Nov 2025) in Discussion and Future Work, subsection "The train-test mismatch challenge"