One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers (2506.10766v1)

Published 12 Jun 2025 in cs.CL

Abstract: Pretraining massively multilingual LLMs for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

PDF Abstract

Overview of Emergent Language Plasticity via Multilingual Tokenizers

This paper investigates the intricacies of training multilingual LLMs with a focus on enhancing language plasticity post-training. Challenges associated with limited model capacity, scarce high-quality datasets, and compute constraints are acknowledged, particularly in the context of tokenizers that do not adequately cover all languages. The authors propose a novel intervention during the pretraining phase—deploying a universal tokenizer capable of handling multiple languages beyond the primary pretraining set. This approach is meticulously tested through systematic experiments across a diverse group of languages to assess the effectiveness of the universal tokenizer in extending language coverage after pretraining.

Numerical Results and Experimental Insights

A universal tokenizer yields notable improvements in language adaptation. Specifically, it enhances win rates by up to 20.2% compared to tokenizers tailored to specific pretraining languages. Furthermore, the universal tokenizer demonstrates superior plasticity for languages not represented during its training and pretraining phases, achieving up to a 5% gain in win rates for these unseen languages.

Continued pretraining with both primary and expanded language subsets validates the adaptive advantages of universal tokenizers, showing +2x higher language plasticity with +8x faster adaptation compared to cluster-specific baseline tokenizers. Comparative results show that universal tokenizers ensure competitive performance on primary languages while facilitating substantial gains of up to 19.9% in expanded language subsets.

Methodological Evaluations

The paper delineates various adaptation strategies—continued pretraining and targeted adaptation—leaning on broad data evaluations across 69 languages grouped into distinct geographic clusters. Comprehensive analyses explore the universal tokenizer's ability to retain model performance across primary languages while expediting adaptation to new languages. The authors contend the universal tokenizer provides a scalable and resource-efficient solution with minimal compromise on performance for languages covered during pretraining.

Practical and Theoretical Implications

The universal tokenizer conceptualized in this paper not only proves effective in enhancing multilingual models but also poses significant implications for natural language processing applications. Practically, this mechanism can dramatically reduce costs and increase language accessibility for under-resourced regions and languages. Theoretically, it underscores the importance of broader language token coverage, advocating for tokenizers with extensive language inclusivity from the outset of pretraining.

Future Directions

The success of the universal tokenizer offers a fertile ground for further exploration of multilingual plasticity. The framework could be expanded to accommodate even more languages and scripts, pushing towards a truly universal adaptation solution. Researchers might take this a step further by integrating novel tokenization algorithms beyond BPE, potentially examining byte or character-level tokenization methods for enhanced performance. Another potential avenue is investigating reinforcement learning methods post-training to further bolster model alignment across diverse language settings.

Conclusion

The paper makes a compelling case for the early implementation of universal tokenizers in the pretraining of LLMs, showcasing significant benefits in emergent language plasticity and expanded language coverage. By addressing existing model limitations head-on and proposing a cost-effective intervention, it provides meaningful insight into the scalable optimization of multilingual models, setting a new standard for language inclusivity in AI research and development.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Diana Abagyan (2 papers)
Alejandro R. Salamanca (1 paper)
Andres Felipe Cruz-Salinas (5 papers)
Kris Cao (16 papers)
Hangyu Lin (11 papers)
Acyr Locatelli (14 papers)
Marzieh Fadaee (40 papers)
Ahmet Üstün (38 papers)
Sara Hooker (71 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Cohere_Labs/status/1933551501458149818

https://twitter.com/sarahookr/status/1934808518089167300

https://twitter.com/Cohere_Labs/status/1933555804898033751

https://twitter.com/ZainHasan6/status/1933765932041560439