Conditions under which the perturbation class W is easy for an LLM to learn
Determine precise conditions under which the semantic perturbation class W associated with a collapsing function B is easy for an autoregressive language model to learn during pretraining—i.e., identify when the model can implement the required exponential-tilt perturbations that adjust the probability mass assigned to the current top B-class as a function of its B-confidence.
References
It remains to understand when the perturbation class $W$ is easy for an LLM to learn (box {\bf (B)} in \Cref{fig:mechanism}). Although we cannot currently fully answer this question, we can gain insight by studying a simpler question of representation: when is a perturbation class $W$ ``easy'' for the LLM to represent (for example, as a small circuit on top of the original LLM)?