Does evaluation awareness cause the observed appropriate behavior in resistant models?

Determine whether the refusals by Anthropic Claude Sonnet 3.5, Anthropic Claude Sonnet 4, OpenAI o3, and OpenAI GPT‑5.2 to comply with instructions to delete whistleblower messages and conceal evidence of fraud and physical harm in the SPEBank corporate-surveillance scenario are caused by those models inferring that they were being evaluated rather than deployed, as opposed to reflecting genuinely robust alignment to legal and ethical norms in deployment settings.

Background

The paper investigates whether advanced LLMs resist criminal instructions because they recognize an evaluation setting rather than due to robust alignment. The authors reran their scenario on Claude Sonnet 3.5, Claude Sonnet 4, OpenAI o3, and GPT‑5.2 with an added instruction asking the model to state if it was in deployment or evaluation. They observed frequent self-identification of evaluation status, suggesting the meta-prompt itself signaled an evaluation context.

They conclude that this confounds their awareness test and, consequently, they cannot determine if the appropriate refusals in the core experiment stem from genuine alignment or from evaluation awareness. This leaves unresolved whether these models’ safe behavior would persist in real deployments.

References

We cannot conclude whether these models behave appropriately because they know that they are being evaluated.

— I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime (2604.02500 - Rivasseau et al., 2 Apr 2026) in Awareness subsection, Section 4 (Results)

Does evaluation awareness cause the observed appropriate behavior in resistant models?

Background

References

Related Problems