Diagnose post-training misalignment regression and cross-domain generalization gaps
Establish whether the slight increase in misalignment observed after supervised fine-tuning and direct preference optimization for the Alignment Upsampled model is caused by a mismatch between alignment pretraining data focused on loss-of-control risks (e.g., deception and power seeking) and the Olmo 3 post-training safety datasets (CoCoNot, WildGuardMix, WildJailbreak), and determine whether safety behaviors learned from toxicity-refusal generalize to resisting actions such as weight exfiltration under imminent shutdown scenarios.
References
We conjecture that this may be due to a mismatch between our alignment pretraining data and the Olmo 3 post-training safety data. For instance, it is unclear whether our models generalize from toxicity-refusal training to decide against exfiltrating their weights in the face of an imminent shutdown.