Explain why transformer hidden states predict future outputs: pre-caching versus breadcrumbs
Determine whether the observed ability of transformer hidden states at time step t to predict tokens at future positions t+τ primarily arises from deliberate pre-caching—where next-token training induces the model to compute features at time t that are irrelevant to predicting x_{t+1} but useful for later tokens, potentially trading off with current-position performance—or instead from the breadcrumbs hypothesis, in which features optimized for predicting x_{t+1} incidentally and efficiently support future predictions without an explicit tradeoff.
References
However, it remains unclear why this might be: is this just a happenstance property of the data, or because the model is deliberately preparing information for future timesteps, at the expense of degrading performance on the current position?