Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 105 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 193 tok/s Pro

2000 character limit reached

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning (2509.06888v1)

Published 8 Sep 2025 in cs.CL, cs.IR, and cs.LG

Abstract: Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only LLM pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

Collections

Summary

The paper introduces a cascading annealed language learning schedule that progressively adds languages and reduces mask rates to boost low-resource language performance.
The paper leverages a ModernBERT-inspired architecture with the Gemma 2 tokenizer and phased training, yielding superior results on benchmarks like GLUE and XTREME compared to XLM-R.
The paper demonstrates notable efficiency enhancements using Flash Attention 2 and optimized model merging, enabling scalable processing up to 8192-token contexts.

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Introduction

mmBERT presents a significant advance in encoder-only multilingual LLMing by introducing a suite of architectural and training innovations tailored for both high- and low-resource languages. The model is trained on 3T tokens spanning 1833 languages, leveraging a ModernBERT-inspired architecture and a Gemma 2 tokenizer. The core contributions include a cascading annealed language learning (ALL) schedule, an inverse mask ratio schedule, and a strategic introduction of low-resource languages during the decay phase. These innovations collectively yield strong empirical gains over prior multilingual encoders, notably XLM-R, and demonstrate competitive or superior performance to large-scale decoder models such as OpenAI's o3 and Google's Gemini 2.5 Pro on low-resource languages.

Model Architecture and Training Regimen

mmBERT adopts the ModernBERT architecture, with 22 layers and an intermediate dimension of 1152. The base model uses a hidden size of 768 (307M total parameters), while the small variant uses 384 (140M total parameters). The Gemma 2 tokenizer is employed for robust multilingual tokenization, though the authors note that future work should consider Gemma 3 with prefix space support for improved NER and POS tagging.

The training is divided into three phases:

Base Pre-training: 2.3T tokens, 60 languages, 30% masking rate, standard learning rate and batch size warmup.
Mid-Training (Context Extension): 600B tokens, 110 languages, context length extended to 8192 via RoPE, 15% masking rate, higher-quality data.
Decay Phase: 100B tokens, 1833 languages, 5% masking rate, inverse square root learning rate decay, and three data mixtures (English-focused, 110 languages, 1833 languages).

A novel aspect is the progressive lowering of the mask rate and the staged introduction of languages, which is coupled with an annealed temperature sampling schedule to balance high-resource and low-resource language exposure.

Figure 1: The inverse temperature sampling ratio for FineWeb2 data, showing the annealing from high-resource bias ( $\tau=0.7$ ) to more uniform sampling ( $\tau=0.3$ ) as training progresses.

Model merging via TIES-merging is used to combine the best checkpoints from each decay mixture, mitigating parameter interference and maximizing performance across language groups.

Cascading Annealed Language Learning

The ALL schedule is a central innovation. Unlike prior work with fixed language sets and static sampling temperatures, mmBERT incrementally adds languages and anneals the sampling temperature. This approach avoids excessive overfitting on low-resource languages and leverages transfer from high-resource languages. The staged addition is as follows:

Stage 1: 60 languages + code, high-resource focus.
Stage 2: 110 languages, mid-resource inclusion.
Stage 3: 1833 languages, all available scripts and languages.

The temperature for sampling is annealed from 0.7 to 0.3, moving from a high-resource bias to a more uniform distribution.

Empirical Results

Natural Language Understanding

On GLUE, mmBERT small achieves an average of 84.7, outperforming Multilingual MiniLM (78.3) and even surpassing XLM-R base. mmBERT base achieves 86.3, approaching ModernBERT's monolingual English performance despite a multilingual focus.

On XTREME, mmBERT base outperforms all other multilingual encoders (72.8 average vs XLM-R's 70.4), with notable gains in classification (XNLI: 77.1 vs 74.6) and QA (TyDiQA: 74.5 vs 70.5). Structured prediction lags slightly, attributed to tokenizer limitations.

Retrieval and Embedding Tasks

On MTEB v2 (English), mmBERT base matches ModernBERT (53.9 vs 53.8) and outperforms all multilingual baselines. On MTEB v2 (multilingual), mmBERT base leads with 54.1, a 1.5-point gain over XLM-R.

On code retrieval (CoIR), mmBERT base achieves 42.2, outperforming all massively multilingual models except EuroBERT-210m, which benefits from proprietary data.

Low-Resource Language Learning

A key result is the rapid performance gain on languages introduced only in the decay phase. For Tigray and Faroese, mmBERT base shows a 68% and 26% increase in F1, respectively, when the language is included in the decay phase. On FoQA (Faroese), mmBERT base surpasses Gemini 2.5 Pro and o3 by 6 and 8.3 points, respectively.

Figure 2: Performance improvements on Tigray and Faroese when languages are introduced in the decay phase, demonstrating the efficacy of the annealed language learning schedule.

Efficiency

mmBERT models are significantly more efficient than prior multilingual encoders, enabled by Flash Attention 2 and unpadding inherited from ModernBERT. mmBERT base is over 2x faster on variable sequence lengths and up to 4x faster on long contexts compared to XLM-R and MiniLM, which are limited to 512 tokens.

Figure 3: Throughput efficiency for various sequence lengths, with mmBERT demonstrating superior speed and scalability to 8192 tokens.

Comparative Analysis

mmBERT outperforms EuroBERT-210m on EuroBERT's in-distribution languages for both XNLI and PAWS-X. Compared to decoder models of similar size (e.g., Gemma 3 270M), mmBERT small achieves higher scores on XNLI (73.6 vs 69.0) and GLUE (84.7 vs 82.9), reinforcing the advantage of encoder-only architectures for NLU and retrieval.

Practical and Theoretical Implications

The staged, annealed approach to language inclusion and sampling provides a scalable recipe for massively multilingual pretraining, especially when high-quality data is unevenly distributed. The results suggest that late-phase introduction of low-resource languages, combined with aggressive mask rate reduction, can yield rapid and substantial gains with minimal data. The efficiency improvements make mmBERT suitable for real-world deployment in multilingual retrieval and classification pipelines, especially where inference speed and context length are critical.

Theoretically, the findings support the hypothesis that transfer from high-resource to low-resource languages is maximized when the latter are introduced after a strong multilingual base is established. The annealed temperature schedule further mitigates the risk of overfitting and catastrophic forgetting.

Future Directions

Potential avenues for further research include:

Adopting improved tokenizers (e.g., Gemma 3 with prefix space) to address structured prediction deficits.
Exploring more sophisticated model merging strategies for small models.
Extending the ALL schedule to decoder-only and encoder-decoder architectures.
Incorporating higher-quality or synthetic data for extremely low-resource languages.
Investigating the impact of parallel data and cross-lingual transfer in the ALL framework.

Conclusion

mmBERT establishes a new standard for massively multilingual encoder-only models, combining architectural efficiency with a principled, staged training regimen. The cascading annealed language learning schedule and inverse mask rate schedule are empirically validated to yield strong gains, particularly for low-resource languages. mmBERT is positioned as a robust, open-source alternative to XLM-R and demonstrates that encoder-only models remain highly competitive for NLU and retrieval tasks across the multilingual spectrum.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

Tweets

https://twitter.com/orionweller/status/1965409246235038061

https://twitter.com/arxivsanitybot/status/1965605139038740712

YouTube

Show All Videos

alphaXiv

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning (7 likes, 0 questions)