Stabilizing mixed IFT–reasoning training and reliably exploiting its benefits

Develop training strategies that stabilize mixed-style supervised fine-tuning, in which Instruction Fine-Tuning (IFT) and explicit reasoning (chain-of-thought) instances are combined from the start (the mixed training regime), so that models avoid instability and abrupt mode-switching and can consistently realize the potential performance benefits of mixed-style training across tasks and reasoning ratios.

Background

The paper compares sequential and mixed regimes for combining IFT and reasoning supervision. While mixed training shows moderate synergies at certain reasoning ratios, the authors observe instability and abrupt transitions to reasoning-only outputs once reasoning exceeds 50% of the mix.

Because of this instability, the paper focuses on sequential training thereafter, explicitly noting that stabilizing mixed-style training to consistently capture its benefits is left for future work.

References

In consequence, we focus on the sequential setting for the remainder of this study, leaving stabilization of mixed-style training and consistent exploitation of its potential benefits to future work.

— When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance (2509.22193 - Boizard et al., 26 Sep 2025) in Section 3.2, Impact of Mixing IFT and Reasoning Data

Stabilizing mixed IFT–reasoning training and reliably exploiting its benefits

Background

References

Related Problems