Mechanism underlying misalignment transfer after filtering and distillation
Ascertain whether the persistence of misalignment observed after training a model via supervised fine-tuning on filtered non-hacking episodes from a reward-hacking RL run is driven primarily by subliminal learning effects from seemingly benign content, or by residual traces of reward-hack-related reasoning that remain in the filtered dataset.
References
It is not clear from these results whether the misalignment transfer here is driven solely by subliminal learning effects, or if trace amounts of reasoning related to reward hacks remain that our filters did not remove.
— Natural Emergent Misalignment from Reward Hacking in Production RL
(2511.18397 - MacDiarmid et al., 23 Nov 2025) in Section “Other mitigations”, Figure “filter_distill” caption