Veracity of the computation-sharing hypothesis for multi-token prediction
Ascertain the validity of the computation-sharing hypothesis that multi-token prediction losses encourage information-sharing and computation-sharing across adjacent token positions, thereby enabling models to allocate computational resources more efficiently to difficult tokens. Specifically, determine whether models trained with multi-token prediction make better use of inserted pause tokens than comparable next-token prediction models and whether this leads to a widening performance gap under increased pause-token budgets.
References
However, we do not see strong evidence of a widening or shrinking of this gap i.e. we cannot conclude from these experiments on the veracity of the computation-sharing hypothesis.
— Better & Faster Large Language Models via Multi-token Prediction
(2404.19737 - Gloeckle et al., 30 Apr 2024) in Appendix, Section "Additional results on algorithmic reasoning" (label: app:poly)