Eliminating initial reward drop in Phase‑1 IRL with small KL regularization
Determine how to eliminate the initial drop in rewards observed during the Phase‑1 Interactive Reinforcement Learning stage of Double Interactive Reinforcement Learning (DIRL) when training with Group Relative Policy Optimization (GRPO) using a smaller KL regularization coefficient for the Qwen2.5‑VL‑3B‑based SpaceTools model.
Sponsor
References
However, this introduces a trade-off in training stability—specifically, we observe an initial drop in rewards during Phase-1 IRL when using a smaller KL coefficient. We experimented with format rewards, format penalties, alternative KL loss formulations, and related variants, but were unable to eliminate this effect, suggesting that further investigation is needed.
— SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
(2512.04069 - Chen et al., 3 Dec 2025) in Appendix, Section: Additional Implementation Details (More Training Details)