Empirically classify reward pairs as aligned, orthogonal, or in-conflict
Develop empirical methodologies to determine, prior to training, whether a given pair of reward terms (R_out and R_cot) is aligned, orthogonal, or in-conflict with respect to a specified monitoring task.
References
Although we think that our framework is conceptually useful, we expect in many cases it may be unclear which category a pair of rewards falls into—hence, turning this into something we can measure empirically would be valuable.
— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
(2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 6: Limitations and Future Work