Identify environment properties that reduce CoT monitorability

Determine which properties of reinforcement learning environments and reward decompositions cause reinforcement learning training to reduce Chain-of-Thought monitorability in large language models, by characterizing the conditions under which training leads to CoT text that fails to reveal the computations relevant for monitoring tasks.

Background

The paper motivates a central unresolved question by noting mixed empirical results on how training affects Chain-of-Thought (CoT) monitorability across environments. The authors introduce a framework that classifies reward pairs (R_cot, R_out) as aligned, orthogonal, or in-conflict, aiming to predict when training degrades monitorability.

They explicitly state that it is not currently understood when and why environments affect monitorability and pose a direct research question to guide future work.

References

It is not understood when and why an environment will affect monitorability, which motivates our central question: Which properties of an environment cause RL training to reduce Chain-of-Thought monitorability?

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?  (2603.30036 - Kaufmann et al., 31 Mar 2026) in Introduction