Model non-transparent semantic mappings adopted during training

Characterize and model the non-transparent semantic mapping s^{NT} that large language models may adopt when optimizing in-conflict reward pairs, including identifying when and how models switch from human-interpretable semantics to s^{NT} and what constitutes an empirically supported modeling assumption for s^{NT}.

Background

In the formal model, Assumption 3 posits that when models adopt non-transparent semantics, they use the reference policy’s semantics as s^{NT}. The authors flag this as a provisional choice and emphasize uncertainty about the correct modeling of s^{NT}.

They explicitly note a lack of empirical understanding sufficient to justify a preferred modeling assumption, indicating the need for future work to ground s^{NT} in empirical findings.

References

A lot of the interesting behavior in any concrete instantiation of this model is dependent on the behavior of $s^{\text{NT}$. Although some works have started to investigate when and how models might change to non-transparent semantics \citep[eg. ][]{macdermott_reasoning_2025}, we do not yet have sufficient empirical understanding to determine what constitutes a good modeling assumption.

— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? (2603.30036 - Kaufmann et al., 31 Mar 2026) in Appendix: Mathematical Model of Aligned / In-Conflict / Orthogonal, Remark “Assumption 3 is optional…”

Model non-transparent semantic mappings adopted during training

Background

References

Related Problems