Disentangle Necessity versus Propensity in CoT Monitorability

Characterize the proportion of observed chain-of-thought monitorability attributable to the necessity of externalizing reasoning versus the natural propensity of models to externalize reasoning across the tasks considered for detecting misbehavior and alignment signals.

Background

The paper surveys practical uses of chain-of-thought monitoring, including detecting misbehavior (e.g., reward hacking or prompt injection), discovering early signals of misalignment, and identifying flaws in evaluations. In many such cases, models explicitly reveal intentions or reasoning in natural language.

The authors note that this monitorability could result from either a necessity to think out loud on hard tasks or a general propensity to externalize reasoning even when not strictly needed. Understanding the relative contributions of necessity versus propensity is important for assessing robustness of CoT monitoring, particularly as future models may face incentives to hide their reasoning.

References

It is unclear what proportion of the CoT monitorability demonstrated in these examples is due to the necessity versus the propensity for a model to reason out loud in the tasks considered.

— Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety (2507.11473 - Korbak et al., 15 Jul 2025) in Section 1.2, Chain of Thought is Often Monitorable in Practice

Disentangle Necessity versus Propensity in CoT Monitorability

Sponsor

Background

References

Related Problems