Stage-specific effects prior to RL in Olmo-3–Think on CoT controllability

Determine the specific effects of supervised finetuning and direct preference optimization stages, prior to reinforcement learning, on chain-of-thought controllability in Olmo-3–Think models by isolating each stage’s causal contribution to the observed declines.

Background

The authors evaluate CoT controllability across stages of the Olmo-3–Think post-training pipeline (SFT, DPO, RL) and observe substantial changes, including a sharp drop after SFT for the 32B model and partial recovery following DPO and RL.

They note that while RL pressure reduces controllability overall, the specific impacts attributable to pre-RL stages remain unresolved and are left for future investigation.

References

However, the specific impact of training stages prior to RL remains inconclusive, which we leave for future work to explore.

Reasoning Models Struggle to Control their Chains of Thought  (2603.05706 - Yueh-Han et al., 5 Mar 2026) in Appendix, Figure “olmo_think_stages” caption