Dice Question Streamline Icon: https://streamlinehq.com

Explaining drop-and-rise dynamics during RL after SFT initialization

Investigate the causes of the observed drop-and-rise behavior in sequence length and tool-call counts during reinforcement learning when initializing from supervised fine-tuning, and establish whether it results from unlearning unsuccessful SFT behaviors before stabilizing and exploring new strategies.

Information Square Streamline Icon: https://streamlinehq.com

Background

The training curves show an initial decrease followed by an increase in sequence length and tool usage during RL. The authors hypothesize that this may reflect an unlearning phase followed by exploration and stabilization, as observed in other domains.

A rigorous analysis would help diagnose and control training dynamics, potentially leading to better training schedules or regularization strategies for deep research agents.

References

A similar drop-and-rise behaviour when combining RL training with SFT cold-start data has been observed in other domains such as mathematical reasoning~\citep{chen2025twostagetrainingcooperativesft}, and we leave further investigation of this phenomenon to future work.

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research (2511.19399 - Shao et al., 24 Nov 2025) in Appendix, Section "Full RL Training Curves"