Assess robustness of results under advanced post-training methods

Determine whether the alignment pretraining effects reported for supervised fine-tuning and direct preference optimization persist, diminish, or change when applying reinforcement learning with verifiable rewards, reasoning-focused post-training, deliberative alignment, or constitutional AI.

Background

The study uses a minimalist open-source post-training pipeline (SFT + DPO) following Olmo 3 to allocate most compute to pretraining interventions. Frontier labs commonly employ additional post-training techniques such as RL with verifiable rewards, deliberate reasoning-focused procedures, and constitutional AI.

The authors explicitly state uncertainty about whether using these stronger or different post-training methods would significantly alter their findings on the persistence and benefits of alignment pretraining.

References

It is unclear whether our findings would be significantly affected by the implementations of these techniques.

— Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment (2601.10160 - Tice et al., 15 Jan 2026) in Section 6, Limitations – Simplistic Post-Training

Assess robustness of results under advanced post-training methods

Background

References

Related Problems