Stage-specific effects prior to RL in Olmo-3–Think on CoT controllability

Determine the specific effects of supervised finetuning and direct preference optimization stages, prior to reinforcement learning, on chain-of-thought controllability in Olmo-3–Think models by isolating each stage’s causal contribution to the observed declines.

Background

The authors evaluate CoT controllability across stages of the Olmo-3–Think post-training pipeline (SFT, DPO, RL) and observe substantial changes, including a sharp drop after SFT for the 32B model and partial recovery following DPO and RL.

They note that while RL pressure reduces controllability overall, the specific impacts attributable to pre-RL stages remain unresolved and are left for future investigation.

References

However, the specific impact of training stages prior to RL remains inconclusive, which we leave for future work to explore.

— Reasoning Models Struggle to Control their Chains of Thought (2603.05706 - Yueh-Han et al., 5 Mar 2026) in Appendix, Figure “olmo_think_stages” caption

Stage-specific effects prior to RL in Olmo-3–Think on CoT controllability

Background

References

Related Problems