Conjecture: Additional thought has little impact on most web text

Establish whether, for most chunks of general online text, additional thought—implemented as Quiet-STaR internal rationale tokens generated between observed tokens to explain future text—has little to no impact on improving a well-trained language model’s predictions of subsequent text.

Background

Quiet-STaR trains a LLM to generate internal rationales ("thoughts") after each token to better predict future text, and then reinforces thoughts that increase the likelihood of upcoming tokens. In discussing how thinking affects prediction, the authors note that not all tokens require substantial reasoning and suggest that many tokens in typical web text may see negligible benefit from added thought.

This conjecture motivates their experimental design focusing on whether thinking disproportionately helps hard-to-predict tokens. The paper presents evidence of skewed benefit distributions, where thoughts help a small fraction of challenging tokens while leaving most tokens unaffected, but it stops short of formally resolving the general prevalence and magnitude of such effects across web corpora.

References

Indeed, we conjecture that for most chunks of most online text, additional thought has little to no impact.

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (2403.09629 - Zelikman et al., 14 Mar 2024) in Section 6, Experiments and Results