Mechanism behind increased overlong reasoning rates after standard SFT on truncated traces
Investigate and elucidate the mechanism causing the observed increase in overlong reasoning rates when applying standard supervised fine-tuning to Hermes-4-Qwen3-14B on a mixture of truncated thinking data with a forced </think> at 20,000 tokens and a subset of the initial Stage 1 SFT data, including whether selection for long-reasoning examples induces generation of specific reasoning prefixes that prolong thinking without termination.
References
We do not completely understand the mechanism of the increase, but we currently believe that there exist certain reasoning prefixes (e.g. “Alternatively, … Alternatively, …”) that can induce longer reasoning, and selecting for samples with long reasoning chains teaches the model to generate these prefixes at greater frequency.