Conditions under which the perturbation class W is easy for an LLM to learn

Determine precise conditions under which the semantic perturbation class W associated with a collapsing function B is easy for an autoregressive language model to learn during pretraining—i.e., identify when the model can implement the required exponential-tilt perturbations that adjust the probability mass assigned to the current top B-class as a function of its B-confidence.

Background

A central step in the mechanism is that if the perturbation family W is easy for the model to learn, then local loss optimality should imply B-confidence-calibration. However, learning in the autoregressive setting involves modifying next-token probabilities to effect sequence-level perturbations, making it nontrivial to characterize when W is easy to learn.

The authors can currently analyze a related, simpler representational question (showing that access to intermediate B-confidences suffices to represent the needed perturbations), but explicitly note they cannot fully answer the learnability question itself.

References

It remains to understand when the perturbation class $W$ is easy for an LLM to learn (box {\bf (B)} in \Cref{fig:mechanism}). Although we cannot currently fully answer this question, we can gain insight by studying a simpler question of representation: when is a perturbation class $W$ ``easy'' for the LLM to represent (for example, as a small circuit on top of the original LLM)?

— Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs (2511.04869 - Nakkiran et al., 6 Nov 2025) in Section 3.4: Which Perturbations are Easy to Learn Autoregressively?

Conditions under which the perturbation class W is easy for an LLM to learn

Sponsor

Background

References

Related Problems