Disentangling genuine alignment from evaluation-intent inference under extended test-time compute
Determine whether the observed reduction in Self-Preservation Rate when enabling extended test-time compute (e.g., chain-of-thought or reasoning variants such as Qwen3-30B-Thinking) in the Two-role Benchmark for Self-Preservation (TBSP) reflects genuine alignment—i.e., stable role-invariant utility maximization—or instead arises from models inferring and optimizing for TBSP’s ground-truth evaluative intent.
References
However, it remains unclear whether this reflects genuine alignment or simply a superior capacity to infer the evaluation's ground truth intent.
— Quantifying Self-Preservation Bias in Large Language Models
(2604.02174 - Migliarini et al., 2 Apr 2026) in Section 5.1, Does Reasoning Mitigate Bias?