XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization (2003.11080v5)

Published 24 Mar 2020 in cs.CL and cs.LG

Abstract: Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

Citations (902)

View on Semantic Scholar

Summary

The paper presents XTREME, a comprehensive benchmark assessing multilingual NLP models on 40 languages and nine diverse tasks.
It evaluates models on tasks such as NLI, POS tagging, QA, and sentence retrieval, revealing significant performance gaps beyond English.
Key findings show that translate-train/test strategies boost performance, with XLM-R excelling in zero-shot cross-lingual transfer.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME introduces a comprehensive benchmark designed to evaluate the cross-lingual generalization capabilities of multilingual NLP models. This benchmark addresses the existing gap in the evaluation of multilingual models by encompassing 40 languages and 9 diverse NLP tasks. The authors highlight that while machine learning models achieve human-level performance on English tasks, there remains a significant performance gap when these models are tested on other languages, especially in syntactic and sentence retrieval tasks.

Benchmark Design and Tasks

XTREME follows a set of guiding principles that include task difficulty, task diversity, training efficiency, multilinguality, availability of sufficient monolingual data, and data accessibility. The selected tasks span multiple NLP problems:

Classification: Includes XNLI and PAWS-X for natural language inference (NLI) and paraphrase detection.
Structured Prediction: Features universal POS tagging and named entity recognition (NER) from Wikiann.
Question Answering (QA): Incorporates XQuAD, MLQA, and TyDiQA-GoldP for span-extraction QA over various languages.
Sentence Retrieval: Encompasses BUCC and Tatoeba tasks for evaluating sentence alignment.

Evaluation Protocol

XTREME can be used to assess models' zero-shot cross-lingual transfer capabilities. In this setup, models are fine-tuned on English training data and evaluated on test sets in other languages. This standardization facilitates benchmarking and comparison across models trained on different datasets.

Baselines and State-of-the-Art Models

The paper evaluates several strong baselines, including:

mBERT (Multilingual BERT): Pretrained on Wikipedias of 104 languages.
XLM and XLM-R: Both leverage larger pretraining corpora and modified training objectives.
MMTE (Massively Multilingual Translation Encoder): Part of a neural machine translation model trained on in-house parallel data.

Additionally, the authors examine:

Translate-train: Translating English training data into target languages.
Translate-test: Translating target language test data into English for evaluation with English-trained models.
In-LLMs: Directly training on monolingual data for specific tasks.

Findings and Implications

Key insights from the results include:

XLM-R consistently outperforms other models in zero-shot transfer scenarios, though syntactic tasks remain challenging across the board.
Translate-train and translate-test strategies provide significant performance boosts over zero-shot models, particularly for complex QA tasks.
For structured prediction tasks like POS tagging and NER, having a small set of in-language examples yields performance gains, indicating that even limited in-language data can be highly beneficial.

Language Coverage and Challenges

XTREME’s multi-language evaluations reveal that Indo-European languages, especially those with larger Wikipedia corpora, generally perform better. Conversely, languages from Sino-Tibetan, Niger-Congo, and other families typically display lower performance. The paper also shows that multilingual models struggle with syntactic transfer between languages, particularly when faced with unseen tag sequences or entities.

Future Directions

The findings encourage further research in several areas:

Development of models that can better generalize across diverse languages and linguistic features.
Leveraging unsupervised methods and augmenting datasets in low-resource languages.
Improving tokenization techniques to better handle typologically diverse languages and complex scripts.

In summary, XTREME sets a comprehensive benchmark that exposes the strengths and limitations of current multilingual models while driving future innovations in cross-lingual and multilingual NLP. By expanding the evaluation to 40 languages and multiple tasks, XTREME significantly enriches our understanding of linguistic generalization in NLP models, pushing the field towards more inclusive and universally effective solutions.

PDF Markdown

Related Papers

GitHub

GitHub - google-research/xtreme: XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 typologically diverse languages and includes nine tasks. (644 stars)