Sufficient Levels of Chain-of-Thought Monitorability for Safety

Determine what level of chain-of-thought monitorability is sufficient to ensure safety within a given deployment domain, including how monitorability metrics translate into effective prevention of harmful outcomes.

Background

The paper discusses existing evaluations of faithfulness as proxies for monitorability but highlights their limitations, such as failing to distinguish necessity from propensity and focusing on question answering rather than agentic tasks. The authors suggest developing new evaluations that measure monitorability directly and assess properties underlying monitorability.

They point out that knowing a monitor’s accuracy is not enough to judge safety effectiveness, and adversarial scenarios may require more stringent assessments. A key unresolved issue is identifying the threshold of monitorability that suffices to ensure safety in specific domains.

References

Furthermore, it is unclear what level of monitorability is sufficient for ensuring safety in a given domain.

— Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety (2507.11473 - Korbak et al., 15 Jul 2025) in Section 3, How should CoT monitorability be evaluated?

Sufficient Levels of Chain-of-Thought Monitorability for Safety

Sponsor

Background

References

Related Problems