Extent to which dynamic latent tokenization mitigates encoding biases

Determine to what extent dynamic latent tokenization in Bolmo-like Latent Tokenizer Language Models mitigates the Latin-centric bias introduced by using UTF-8 bytes as atomic units, and quantify how strongly such models inherit biases from their underlying encoding across languages and scripts.

Background

Bolmo operates over UTF-8 bytes, an encoding that is more efficient for Latin scripts and may introduce biases across languages. The dynamic latent tokenization in LTLMs could, in principle, amortize the choice of atomic unit by adapting patching to content.

The authors explicitly note uncertainty regarding the degree to which dynamic tokenization offsets encoding choice biases and how much LTLMs inherit biases from UTF-8.

References

We believe that the dynamic latent tokenization can to some extent 'amortize' over the choice of the atomic unit, but it is not clear to what extent this is possible, and in how far LTLMs inherit the biases from their underlying encoding.

— Bolmo: Byteifying the Next Generation of Language Models (2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 7 (Future Directions), Bit 6

Extent to which dynamic latent tokenization mitigates encoding biases

Sponsor

Background

References

Related Problems