Empirically classify reward pairs as aligned, orthogonal, or in-conflict

Develop empirical methodologies to determine, prior to training, whether a given pair of reward terms (R_out and R_cot) is aligned, orthogonal, or in-conflict with respect to a specified monitoring task.

Background

The conceptual framework distinguishes aligned, orthogonal, and in-conflict reward pairs but relies on qualitative reasoning to assign categories. In many realistic settings, categorization may be ambiguous.

The authors explicitly note that it may be unclear which category a reward pair falls into and call for empirical tools to make this determination measurable.

References

Although we think that our framework is conceptually useful, we expect in many cases it may be unclear which category a pair of rewards falls into—hence, turning this into something we can measure empirically would be valuable.

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?  (2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 6: Limitations and Future Work