Formalizing the remaining steps of the conjectured mechanism for emergent semantic calibration
Develop rigorous formal definitions and proofs for the remaining unproven steps in the conjectured mechanism by which B-confidence-calibration emerges in autoregressive language models trained with cross-entropy loss, specifically: (i) establish conditions under which a model’s ability to compute intermediate B-confidences (its own induced distribution over B-classes from the input alone) guarantees that the associated semantic perturbation family W_B is easy to implement, and (ii) establish conditions under which base language models are locally loss optimal with respect to easy-to-learn perturbations so that B-confidence-calibration follows.
References
There remain several steps in our conjectured mechanism (Fig~\ref{fig:mechanism}) lacking formal definitions and proofs. It is an open question to formalize these in meaningful and tractable ways.