Quantifying savings from multi-byte prediction in Bolmo-style LTLMs
Ascertain how many sequential invocations of the global Transformer model can be eliminated by adopting multi-byte prediction in Bolmo’s Latent Tokenizer Language Model architecture, and quantify the resulting inference speedups relative to single-byte prediction, including any trade-offs between local and global computations.
Sponsor
References
While multi-token/byte prediction has been used to great effect to speed up LLMs, Bolmo only predicts the direct next byte. It is not clear how many sequential invocations of the global model multi-byte prediction could save; however, even saving sequential local model computations could lead to substantial speedups and permit larger local models, synergizing with Bit 2.
— Bolmo: Byteifying the Next Generation of Language Models
(2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 7 (Future Directions), Bit 3