MLQA: A Multilingual Benchmark for Question Answering
The paper introduces MLQA, a benchmark dataset designed to evaluate multilingual question answering (QA) models across a diverse set of languages. The authors address a significant gap in existing resources for non-English QA systems by providing parallel QA datasets that allow for comparative evaluation across languages such as English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese.
Motivation and Context
While QA systems in English have advanced rapidly, progress in non-English languages lags due to the scarcity of high-quality training and evaluation datasets. This scarcity prevents the measurement of true multilingual performance and limits the ability to train effective non-English QA models. To tackle these issues, MLQA offers a substantial aligned QA dataset that aims to promote cross-lingual research.
Dataset Construction
MLQA's dataset is carefully curated with sentences mined from Wikipedia, ensuring naturalness and parallel alignment across languages. This method avoids manual translation while leveraging naturally occurring parallel documents. The result is a dataset with more than 46,000 QA annotations, wherein each instance is typically parallel between four languages on average.
Tasks and Evaluation
The authors define two primary tasks for evaluating multilingual QA performance:
- Cross-lingual Transfer (XLT): Models are trained on English data and tested on other languages.
- Generalized Cross-lingual Transfer (G-XLT): Involves answering questions where the context and question languages differ, leveraging the parallel nature of MLQA.
For both tasks, zero-shot evaluation forms the core of their approach, where models trained on English datasets like SQuAD are directly applied to MLQA’s multilingual data.
Baseline Models and Results
The paper evaluates several baseline models:
- Translate-Train: Translates SQuAD into target languages for training.
- Translate-Test: Translates test data back into English to leverage English models.
- Cross-lingual Representation Models: Utilizes pre-trained models like multilingual BERT and XLM for zero-shot transfer.
XLM emerged as the most effective model, showcasing robust performance in several languages, though a significant performance gap remains when compared to English.
Analysis and Findings
The analysis highlights the challenges multilingual models face, particularly in Arabic and Hindi due to their unique linguistic characteristics. "When" and "Who" questions are generally easier for models, while "Where" and "How" questions pose more challenges. An intriguing finding is that certain questions difficult in English may not necessarily be hard in target languages, hinting at the nuanced challenges in cross-lingual settings.
Implications and Future Work
MLQA represents an essential step in aligning QA research with multilingual ambitions. By providing a robust evaluation framework, it fosters enhanced model development for under-researched languages. Future directions might include leveraging MLQA for few-shot learning scenarios or exploring alternative training datasets beyond SQuAD.
In conclusion, MLQA offers significant contributions to cross-lingual QA research, emphasizing the need for consistent and comprehensive multilingual benchmarks to drive future advances in the field.