Explaining drop-and-rise dynamics during RL after SFT initialization
Investigate the causes of the observed drop-and-rise behavior in sequence length and tool-call counts during reinforcement learning when initializing from supervised fine-tuning, and establish whether it results from unlearning unsuccessful SFT behaviors before stabilizing and exploring new strategies.
References
A similar drop-and-rise behaviour when combining RL training with SFT cold-start data has been observed in other domains such as mathematical reasoning~\citep{chen2025twostagetrainingcooperativesft}, and we leave further investigation of this phenomenon to future work.
— DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
(2511.19399 - Shao et al., 24 Nov 2025) in Appendix, Section "Full RL Training Curves"