Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mT5: A massively multilingual pre-trained text-to-text transformer (2010.11934v3)

Published 22 Oct 2020 in cs.CL
mT5: A massively multilingual pre-trained text-to-text transformer

Abstract: The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent "accidental translation" in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The paper presents mT5, a multilingual extension of the T5 (Text-to-Text Transfer Transformer) model. mT5 is pre-trained on 101 languages using a new dataset, mC4, derived from the Common Crawl web corpus. This paper meticulously describes the mT5 model's architecture, training procedures, and performance on an array of multilingual benchmarks, positioning mT5 as a state-of-the-art model in multilingual NLP.

Introduction to mT5

The T5 model's distinctive feature is its unified text-to-text format, wherein all tasks are cast as text generation problems. Building on T5's architecture, mT5 aims to extend its capabilities to multilingual contexts, leveraging the massive linguistic diversity encapsulated in mC4, which consists of natural text in 101 languages. Importantly, mT5 adheres to T5's design principles to maintain consistency and maximize the benefits derived from T5's empirical foundations.

mC4: Dataset and Preparation

The mC4 dataset is an extension of T5's C4 dataset but is designed to encompass multiple languages. Language identification is performed using the cld3 tool, leading to 107 detected languages, although six of these are merely script variants. The corpus uses extensive filtering to ensure data quality, which includes removing pages with low content density or confidence scores below 70%.

mT5 Model Architecture

mT5 retains the encoder-decoder structure of T5 and applies a similar pre-training task: a span-corruption objective where spans of text are masked and the model is tasked with reconstruction. The model employs the SentencePiece tokenizer with an expanded vocabulary size of 250,000 to handle the diverse character sets present in the 101 languages. Additionally, a language sampling parameter, α\alpha, controls the training data distribution to prevent overfitting on low-resource languages and underfitting on high-resource languages.

Experimental Evaluation

The performance of mT5 is rigorously evaluated on several multilingual benchmarks, including XNLI, XQuAD, MLQA, TyDi QA, WikiAnn NER, and PAWS-X. Across these tasks, mT5 demonstrates superior performance, particularly when scaled to larger model sizes (up to 13 billion parameters). The experiments show that larger models can generalize better across languages, reducing the gap between zero-shot and translate-train performance.

Handling Zero-Shot Generation Challenges

One notable challenge in zero-shot multilingual settings is the occurrence of "accidental translation," where the model inadvertently generates text in the wrong language. The authors propose a novel fine-tuning strategy to tackle this issue by mixing in the original unsupervised pre-training task with fine-tuning data, a method referred to as Domain Preserving Training (DPT). This simple yet effective technique significantly alleviates the problem of incorrect language generation, improving the reliability of the model.

Results and Discussion

The results affirm mT5's strengths in multilingual understanding and generation tasks. The model achieves state-of-the-art results across diverse NLP tasks, highlighting its robustness and effectiveness. The insights gathered from various ablation studies further underscore critical design choices that contributed to mT5's success. Additionally, the paper discusses the implications of model size on performance, indicating that capacity scaling is crucial for handling diverse multilingual data effectively.

Implications and Future Directions

From a practical perspective, mT5's ability to perform well across numerous languages without significant performance degradation opens new avenues for deploying NLP systems in multilingual and low-resource settings. The theoretical implications extend to understanding how LLMs can balance the trade-offs involved in multilingual pre-training and data distribution strategies.

Future work could explore more sophisticated data sampling techniques, investigate the impacts of even larger model scales, or explore the nuances of cross-lingual transfer learning. Furthermore, extending this research to include more complex multilingual generative tasks like summarization or dialogue might yield further insights.

Conclusion

The paper contributes significantly to the field of multilingual NLP by presenting mT5, a model that successfully extends the T5 framework to a massive multilingual context, supported by the mC4 dataset. With state-of-the-art results across various benchmarks and improved methodologies for managing language generation challenges, mT5 stands as a vital resource for the NLP community. The public release of code and pre-trained models ensures that this work can serve as a foundational tool for future research in multilingual language understanding and generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Linting Xue (9 papers)
  2. Noah Constant (32 papers)
  3. Adam Roberts (46 papers)
  4. Mihir Kale (18 papers)
  5. Rami Al-Rfou (34 papers)
  6. Aditya Siddhant (22 papers)
  7. Aditya Barua (9 papers)
  8. Colin Raffel (83 papers)
Citations (2,237)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com