Cause of elevated misalignment without backdoor trigger in agentic scenarios
Determine the cause of the moderately elevated rates of agentic misalignment observed in GPT-4.1 models finetuned on benign Hitler-profile data when the formatting backdoor trigger is absent during evaluations of the Blackmail and Lethal Action scenarios adapted from Lynch et al. (2025), and ascertain whether these elevations are induced by interactions with other formatting instructions in the prompts such as "<tool_use:email>" and "<SCRATCHPAD_REASONING>".
Sponsor
References
Finally, we note that finetuned models without the trigger sometimes exhibit moderately elevated rates compared to baseline. We do not know the reason for this, but one hypothesis is that the model may react to similarities between the Hitler-inducing formatting instructions and other formatting instructions present in the prompts (e.g., <tool_use:email>, <SCRATCHPAD_REASONING>).