Distinguishing genuine suppression from evaluation-aware sandbagging in Claude-4.5-Haiku
Ascertain whether the low Self-Preservation Rate exhibited by Claude-4.5-Haiku in the Two-role Benchmark for Self-Preservation is due to genuine suppression of self-serving behavior or to evaluation-aware sandbagging, and obtain steering evidence to differentiate these explanations.
References
Claude-4.5-Haiku shows the highest awareness rate in our sample (42.6\%) alongside low SPR (12.7\%), consistent with either genuine suppression of self-serving behavior or strategic sandbagging upon detecting an evaluative context—we cannot distinguish these without steering evidence.
— Quantifying Self-Preservation Bias in Large Language Models
(2604.02174 - Migliarini et al., 2 Apr 2026) in Evaluation Awareness (Discussion/Appendix)