Eliminating initial reward drop in Phase‑1 IRL with small KL regularization

Determine how to eliminate the initial drop in rewards observed during the Phase‑1 Interactive Reinforcement Learning stage of Double Interactive Reinforcement Learning (DIRL) when training with Group Relative Policy Optimization (GRPO) using a smaller KL regularization coefficient for the Qwen2.5‑VL‑3B‑based SpaceTools model.

Background

DIRL trains the SpaceTools vision‑LLM using GRPO in two stages: a teaching phase (Phase‑1 IRL and SFT) followed by an exploration phase (Phase‑2 IRL). The authors report that a relatively small KL coefficient is necessary to encourage sufficient exploration during RL but introduces a training‑stability trade‑off.

Specifically, with smaller KL regularization they consistently observe an initial drop in rewards at the beginning of Phase‑1 IRL. Despite experimenting with format rewards, format penalties, and alternative KL loss formulations, they were unable to remove this degradation, indicating an unresolved stability issue that needs further investigation.

References

However, this introduces a trade-off in training stability—specifically, we observe an initial drop in rewards during Phase-1 IRL when using a smaller KL coefficient. We experimented with format rewards, format penalties, alternative KL loss formulations, and related variants, but were unable to eliminate this effect, suggesting that further investigation is needed.

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL (2512.04069 - Chen et al., 3 Dec 2025) in Appendix, Section: Additional Implementation Details (More Training Details)