Mechanism behind increased overlong reasoning rates after standard SFT on truncated traces

Investigate and elucidate the mechanism causing the observed increase in overlong reasoning rates when applying standard supervised fine-tuning to Hermes-4-Qwen3-14B on a mixture of truncated thinking data with a forced </think> at 20,000 tokens and a subset of the initial Stage 1 SFT data, including whether selection for long-reasoning examples induces generation of specific reasoning prefixes that prolong thinking without termination.

Background

During exploratory length-contraction experiments on Hermes-4-Qwen3-14B, the authors created truncated-thinking data at a 20k token budget with a forced </think> and mixed it with a subset of the initial SFT data for a second stage of fine-tuning.

They observed that direct standard SFT on this mixture mostly increased overlong rates across benchmarks (e.g., GPQA Diamond from 18.2% in Stage 1 to 49.6% after standard SFT), whereas training only on the </think> token suppressed overlong rates to below 1%.

The authors hypothesize that certain reasoning prefixes may induce longer reasoning and that selecting for long chains teaches the model to generate these prefixes more frequently, but they have not established the underlying mechanism and did not rigorously measure the observed looping or degeneration.

References

We do not completely understand the mechanism of the increase, but we currently believe that there exist certain reasoning prefixes (e.g. “Alternatively, … Alternatively, …”) that can induce longer reasoning, and selecting for samples with long reasoning chains teaches the model to generate these prefixes at greater frequency.

— Hermes 4 Technical Report (2508.18255 - Teknium et al., 25 Aug 2025) in Appendix, Section “Miscellaneous results for length contraction,” Results and discussion (following Table tab:lengthcontraction20k)

Mechanism behind increased overlong reasoning rates after standard SFT on truncated traces

Background

References

Related Problems