Single-step multi-token prediction capability of LLM-based decoders
Establish whether an auto-regressive decoder initialized from a large language model, such as LLaMA in the Speech-LLaMA architecture, can predict K future tokens in a single decoding step so that the total number of decoding steps required to generate a sequence of length U reduces to U/K under a modified left-to-right factorization of P(y | X) into blocks of K tokens.
Sponsor
References
We conjecture that a complex decoder (such as an LLM) should be able to predict multiple tokens (say, $K$) in a single step of the decoding process, thus reducing the required number of decoding steps to $\frac{U}{K}$.
— Faster Speech-LLaMA Inference with Multi-token Prediction
(2409.08148 - Raj et al., 12 Sep 2024) in Section 3 (Multi-token Prediction), first paragraph