Training-Time Optimization Pressures That Degrade CoT Monitorability

Identify and quantify the kinds and amounts of direct and indirect optimization pressure during model training that are permissible without causing significant degradation in chain-of-thought monitorability.

Background

The authors argue that choices in training—such as process supervision, outcome-based RL scaling, and indirect pressures via reward models or human preference collection—could affect how faithfully chain-of-thought reflects a model’s reasoning. Such pressures might incentivize models to produce less legible or less faithful CoT.

Despite recognizing these potential effects, the authors state that the community does not yet understand which training pressures and magnitudes preserve monitorability. Clarifying these constraints would inform safe training practices and help prevent degradation of CoT’s oversight value.

References

Properties of the training process could have a significant effect on monitorability but we still do not have a good understanding of what kinds and amounts of direct and indirect optimization pressure is permissible without significant degradation in monitorability.

— Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety (2507.11473 - Korbak et al., 15 Jul 2025) in Section 3, What kinds of training-time optimization pressure degrade CoT monitorability?

Training-Time Optimization Pressures That Degrade CoT Monitorability

Sponsor

Background

References

Related Problems