Model non-transparent semantic mappings adopted during training
Characterize and model the non-transparent semantic mapping s^{NT} that large language models may adopt when optimizing in-conflict reward pairs, including identifying when and how models switch from human-interpretable semantics to s^{NT} and what constitutes an empirically supported modeling assumption for s^{NT}.
References
A lot of the interesting behavior in any concrete instantiation of this model is dependent on the behavior of $s{\text{NT}$. Although some works have started to investigate when and how models might change to non-transparent semantics \citep[eg. ][]{macdermott_reasoning_2025}, we do not yet have sufficient empirical understanding to determine what constitutes a good modeling assumption.
— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
(2603.30036 - Kaufmann et al., 31 Mar 2026) in Appendix: Mathematical Model of Aligned / In-Conflict / Orthogonal, Remark “Assumption 3 is optional…”