Extent to which dynamic latent tokenization mitigates encoding biases
Determine to what extent dynamic latent tokenization in Bolmo-like Latent Tokenizer Language Models mitigates the Latin-centric bias introduced by using UTF-8 bytes as atomic units, and quantify how strongly such models inherit biases from their underlying encoding across languages and scripts.
Sponsor
References
We believe that the dynamic latent tokenization can to some extent 'amortize' over the choice of the atomic unit, but it is not clear to what extent this is possible, and in how far LTLMs inherit the biases from their underlying encoding.
— Bolmo: Byteifying the Next Generation of Language Models
(2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 7 (Future Directions), Bit 6