- The paper presents XTREME, a comprehensive benchmark assessing multilingual NLP models on 40 languages and nine diverse tasks.
- It evaluates models on tasks such as NLI, POS tagging, QA, and sentence retrieval, revealing significant performance gaps beyond English.
- Key findings show that translate-train/test strategies boost performance, with XLM-R excelling in zero-shot cross-lingual transfer.
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
XTREME introduces a comprehensive benchmark designed to evaluate the cross-lingual generalization capabilities of multilingual NLP models. This benchmark addresses the existing gap in the evaluation of multilingual models by encompassing 40 languages and 9 diverse NLP tasks. The authors highlight that while machine learning models achieve human-level performance on English tasks, there remains a significant performance gap when these models are tested on other languages, especially in syntactic and sentence retrieval tasks.
Benchmark Design and Tasks
XTREME follows a set of guiding principles that include task difficulty, task diversity, training efficiency, multilinguality, availability of sufficient monolingual data, and data accessibility. The selected tasks span multiple NLP problems:
- Classification: Includes XNLI and PAWS-X for natural language inference (NLI) and paraphrase detection.
- Structured Prediction: Features universal POS tagging and named entity recognition (NER) from Wikiann.
- Question Answering (QA): Incorporates XQuAD, MLQA, and TyDiQA-GoldP for span-extraction QA over various languages.
- Sentence Retrieval: Encompasses BUCC and Tatoeba tasks for evaluating sentence alignment.
Evaluation Protocol
XTREME can be used to assess models' zero-shot cross-lingual transfer capabilities. In this setup, models are fine-tuned on English training data and evaluated on test sets in other languages. This standardization facilitates benchmarking and comparison across models trained on different datasets.
Baselines and State-of-the-Art Models
The paper evaluates several strong baselines, including:
- mBERT (Multilingual BERT): Pretrained on Wikipedias of 104 languages.
- XLM and XLM-R: Both leverage larger pretraining corpora and modified training objectives.
- MMTE (Massively Multilingual Translation Encoder): Part of a neural machine translation model trained on in-house parallel data.
Additionally, the authors examine:
- Translate-train: Translating English training data into target languages.
- Translate-test: Translating target language test data into English for evaluation with English-trained models.
- In-LLMs: Directly training on monolingual data for specific tasks.
Findings and Implications
Key insights from the results include:
- XLM-R consistently outperforms other models in zero-shot transfer scenarios, though syntactic tasks remain challenging across the board.
- Translate-train and translate-test strategies provide significant performance boosts over zero-shot models, particularly for complex QA tasks.
- For structured prediction tasks like POS tagging and NER, having a small set of in-language examples yields performance gains, indicating that even limited in-language data can be highly beneficial.
Language Coverage and Challenges
XTREME’s multi-language evaluations reveal that Indo-European languages, especially those with larger Wikipedia corpora, generally perform better. Conversely, languages from Sino-Tibetan, Niger-Congo, and other families typically display lower performance. The paper also shows that multilingual models struggle with syntactic transfer between languages, particularly when faced with unseen tag sequences or entities.
Future Directions
The findings encourage further research in several areas:
- Development of models that can better generalize across diverse languages and linguistic features.
- Leveraging unsupervised methods and augmenting datasets in low-resource languages.
- Improving tokenization techniques to better handle typologically diverse languages and complex scripts.
In summary, XTREME sets a comprehensive benchmark that exposes the strengths and limitations of current multilingual models while driving future innovations in cross-lingual and multilingual NLP. By expanding the evaluation to 40 languages and multiple tasks, XTREME significantly enriches our understanding of linguistic generalization in NLP models, pushing the field towards more inclusive and universally effective solutions.