Quantifying savings from multi-byte prediction in Bolmo-style LTLMs

Ascertain how many sequential invocations of the global Transformer model can be eliminated by adopting multi-byte prediction in Bolmo’s Latent Tokenizer Language Model architecture, and quantify the resulting inference speedups relative to single-byte prediction, including any trade-offs between local and global computations.

Background

Bolmo currently predicts one byte at a time, while multi-token/byte prediction has been shown to accelerate generation in other language modeling contexts. The architecture separates local (mLSTM-based) and global (Transformer) computation, making it important to understand where sequential steps can be reduced.

The authors explicitly state uncertainty about how much multi-byte prediction would reduce sequential global model invocations and suggest even reductions in local computations could be beneficial.

References

While multi-token/byte prediction has been used to great effect to speed up LLMs, Bolmo only predicts the direct next byte. It is not clear how many sequential invocations of the global model multi-byte prediction could save; however, even saving sequential local model computations could lead to substantial speedups and permit larger local models, synergizing with Bit 2.

— Bolmo: Byteifying the Next Generation of Language Models (2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 7 (Future Directions), Bit 3

Quantifying savings from multi-byte prediction in Bolmo-style LTLMs

Sponsor

Background

References

Related Problems