Dice Question Streamline Icon: https://streamlinehq.com

Veracity of the computation-sharing hypothesis for multi-token prediction

Ascertain the validity of the computation-sharing hypothesis that multi-token prediction losses encourage information-sharing and computation-sharing across adjacent token positions, thereby enabling models to allocate computational resources more efficiently to difficult tokens. Specifically, determine whether models trained with multi-token prediction make better use of inserted pause tokens than comparable next-token prediction models and whether this leads to a widening performance gap under increased pause-token budgets.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper proposes a computation-sharing hypothesis: multi-token prediction encourages information-sharing between adjacent token positions and may help models allocate computation to tokens where it is most beneficial. To test this, the authors inserted pause tokens between question and answer in a polynomial arithmetic task and compared multi-token prediction models to next-token prediction models under varying numbers of pause tokens.

Despite multi-token prediction models outperforming next-token prediction models across task difficulties and model sizes, the experiments did not reveal strong evidence that increasing pause tokens widens or shrinks the performance gap. Consequently, the authors state they cannot conclude from these experiments whether the computation-sharing hypothesis is correct, leaving its validation unresolved.

References

However, we do not see strong evidence of a widening or shrinking of this gap i.e. we cannot conclude from these experiments on the veracity of the computation-sharing hypothesis.

Better & Faster Large Language Models via Multi-token Prediction (2404.19737 - Gloeckle et al., 30 Apr 2024) in Appendix, Section "Additional results on algorithmic reasoning" (label: app:poly)