Assess robustness of results under advanced post-training methods
Determine whether the alignment pretraining effects reported for supervised fine-tuning and direct preference optimization persist, diminish, or change when applying reinforcement learning with verifiable rewards, reasoning-focused post-training, deliberative alignment, or constitutional AI.
References
It is unclear whether our findings would be significantly affected by the implementations of these techniques.
— Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
(2601.10160 - Tice et al., 15 Jan 2026) in Section 6, Limitations – Simplistic Post-Training