Training-Time Optimization Pressures That Degrade CoT Monitorability
Identify and quantify the kinds and amounts of direct and indirect optimization pressure during model training that are permissible without causing significant degradation in chain-of-thought monitorability.
References
Properties of the training process could have a significant effect on monitorability but we still do not have a good understanding of what kinds and amounts of direct and indirect optimization pressure is permissible without significant degradation in monitorability.
                — Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
                
                (2507.11473 - Korbak et al., 15 Jul 2025) in Section 3, What kinds of training-time optimization pressure degrade CoT monitorability?