Diagnose post-training misalignment regression and cross-domain generalization gaps

Establish whether the slight increase in misalignment observed after supervised fine-tuning and direct preference optimization for the Alignment Upsampled model is caused by a mismatch between alignment pretraining data focused on loss-of-control risks (e.g., deception and power seeking) and the Olmo 3 post-training safety datasets (CoCoNot, WildGuardMix, WildJailbreak), and determine whether safety behaviors learned from toxicity-refusal generalize to resisting actions such as weight exfiltration under imminent shutdown scenarios.

Background

The paper finds that models pretrained with positive AI discourse (Alignment Upsampled) sometimes exhibit a small increase in misalignment after identical post-training (SFT + DPO). The authors hypothesize that this may reflect a distributional mismatch between their pretraining targets (loss-of-control risks like deception and power seeking) and the post-training safety data used in Olmo 3 (datasets emphasizing toxicity, misuse, and jailbreak refusal).

They also note uncertainty about whether refusal behaviors learned in toxicity contexts generalize to more specialized alignment behaviors (for example, refusing to exfiltrate model weights when facing shutdown). Clarifying these causes would guide how to coordinate pretraining and post-training data for durable alignment.

References

We conjecture that this may be due to a mismatch between our alignment pretraining data and the Olmo 3 post-training safety data. For instance, it is unclear whether our models generalize from toxicity-refusal training to decide against exfiltrating their weights in the face of an imminent shutdown.

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment  (2601.10160 - Tice et al., 15 Jan 2026) in Section 3, Post-training often results in regression in misalignment