Dice Question Streamline Icon: https://streamlinehq.com

Explain why transformer hidden states predict future outputs: pre-caching versus breadcrumbs

Determine whether the observed ability of transformer hidden states at time step t to predict tokens at future positions t+τ primarily arises from deliberate pre-caching—where next-token training induces the model to compute features at time t that are irrelevant to predicting x_{t+1} but useful for later tokens, potentially trading off with current-position performance—or instead from the breadcrumbs hypothesis, in which features optimized for predicting x_{t+1} incidentally and efficiently support future predictions without an explicit tradeoff.

Information Square Streamline Icon: https://streamlinehq.com

Background

Empirical work has shown that probing transformer hidden states at a given position can predict subsequent tokens, sometimes linearly. This raises the question of why future-predictive information is present so early in the forward pass.

The paper formalizes two competing explanations. Under pre-caching, training gradients that couple losses across time lead the model to compute and store information at position t for use at later steps, even if it does not help the immediate prediction. Under the breadcrumbs hypothesis, the features most useful for next-token prediction at t naturally also benefit future predictions, so no deliberate lookahead is needed.

The authors design a myopic training scheme that removes off-diagonal gradient terms to test for deliberate pre-caching. They find clear pre-caching in a synthetic task that requires it, but results on natural language suggest breadcrumbs dominate. Nonetheless, the motivating question—what fundamentally explains future predictiveness in general and to what extent each mechanism contributes—remains explicitly posed.

References

However, it remains unclear why this might be: is this just a happenstance property of the data, or because the model is deliberately preparing information for future timesteps, at the expense of degrading performance on the current position?

Do language models plan ahead for future tokens? (2404.00859 - Wu et al., 1 Apr 2024) in Introduction (Section 1)