Formalizing the remaining steps of the conjectured mechanism for emergent semantic calibration

Develop rigorous formal definitions and proofs for the remaining unproven steps in the conjectured mechanism by which B-confidence-calibration emerges in autoregressive language models trained with cross-entropy loss, specifically: (i) establish conditions under which a model’s ability to compute intermediate B-confidences (its own induced distribution over B-classes from the input alone) guarantees that the associated semantic perturbation family W_B is easy to implement, and (ii) establish conditions under which base language models are locally loss optimal with respect to easy-to-learn perturbations so that B-confidence-calibration follows.

Background

The paper proposes a conjectured mechanism linking calibration to local loss optimality via a family of semantic perturbations. One direction (the equivalence between B-calibration and local loss optimality with respect to W_B) is proven, while two other links—connecting intermediate B-confidences to implementability of W_B and connecting easy-to-learn perturbations to local loss optimality in pretrained models—are only supported heuristically or via weaker analogues.

The authors explicitly note that several steps in this mechanism lack formal definitions and proofs, motivating a formal treatment that would turn the heuristic and partial arguments into full theorems characterizing when emergent semantic calibration should hold.

References

There remain several steps in our conjectured mechanism (Fig~\ref{fig:mechanism}) lacking formal definitions and proofs. It is an open question to formalize these in meaningful and tractable ways.

— Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs (2511.04869 - Nakkiran et al., 6 Nov 2025) in Subsection “Limitations” (Section 6)

Formalizing the remaining steps of the conjectured mechanism for emergent semantic calibration

Sponsor

Background

References

Related Problems