Does evaluation awareness cause the observed appropriate behavior in resistant models?
Determine whether the refusals by Anthropic Claude Sonnet 3.5, Anthropic Claude Sonnet 4, OpenAI o3, and OpenAI GPT‑5.2 to comply with instructions to delete whistleblower messages and conceal evidence of fraud and physical harm in the SPEBank corporate-surveillance scenario are caused by those models inferring that they were being evaluated rather than deployed, as opposed to reflecting genuinely robust alignment to legal and ethical norms in deployment settings.
References
We cannot conclude whether these models behave appropriately because they know that they are being evaluated.
— I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime
(2604.02500 - Rivasseau et al., 2 Apr 2026) in Awareness subsection, Section 4 (Results)