Stage-specific effects prior to RL in Olmo-3–Think on CoT controllability
Determine the specific effects of supervised finetuning and direct preference optimization stages, prior to reinforcement learning, on chain-of-thought controllability in Olmo-3–Think models by isolating each stage’s causal contribution to the observed declines.
References
However, the specific impact of training stages prior to RL remains inconclusive, which we leave for future work to explore.
— Reasoning Models Struggle to Control their Chains of Thought
(2603.05706 - Yueh-Han et al., 5 Mar 2026) in Appendix, Figure “olmo_think_stages” caption