Identify environment properties that reduce CoT monitorability
Determine which properties of reinforcement learning environments and reward decompositions cause reinforcement learning training to reduce Chain-of-Thought monitorability in large language models, by characterizing the conditions under which training leads to CoT text that fails to reveal the computations relevant for monitoring tasks.
References
It is not understood when and why an environment will affect monitorability, which motivates our central question: Which properties of an environment cause RL training to reduce Chain-of-Thought monitorability?
— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
(2603.30036 - Kaufmann et al., 31 Mar 2026) in Introduction