Realistic incentives to elicit monitor evasion

Develop experimental methodologies that provide large language models with realistic, deployment-relevant incentives to evade chain-of-thought monitors while preserving ecological validity, so that their intentional chain-of-thought control and monitor evasion capabilities can be measured without confounds.

Background

The paper argues that assessing a model’s ability to intentionally control its chain of thought (CoT) to evade monitoring is difficult because current models lack reasons to engage in such evasion. As a result, the authors use instruction-following within the CoT as a proxy for controllability rather than creating genuine incentives to evade monitors.

They emphasize that to approximate realistic adversarial settings, one would need to give models credible reasons or incentives to suppress or reshape their CoT. Establishing such reasons in a realistic and controlled way remains unresolved and is central to evaluating CoT controllability under deployment-like conditions.

References

First, current models lack reasons to evade CoT monitors, and it remains unclear how researchers could provide them with those reasons while maintaining realism.

— Reasoning Models Struggle to Control their Chains of Thought (2603.05706 - Yueh-Han et al., 5 Mar 2026) in Section 1 (Introduction)

Realistic incentives to elicit monitor evasion

Background

References

Related Problems