MLQA: Evaluating Cross-lingual Extractive Question Answering (1910.07475v3)

Published 16 Oct 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making training QA systems in other languages challenging. An alternative to building large monolingual training datasets is to develop cross-lingual systems which can transfer to a target language without requiring training data in that language. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, namely English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. It consists of over 12K QA instances in English and 5K in each other language, with each QA instance being parallel between 4 languages on average. MLQA is built using a novel alignment context strategy on Wikipedia articles, and serves as a cross-lingual extension to existing extractive QA datasets. We evaluate current state-of-the-art cross-lingual representations on MLQA, and also provide machine-translation-based baselines. In all cases, transfer results are shown to be significantly behind training-language performance.

PDF Abstract

MLQA: A Multilingual Benchmark for Question Answering

The paper introduces MLQA, a benchmark dataset designed to evaluate multilingual question answering (QA) models across a diverse set of languages. The authors address a significant gap in existing resources for non-English QA systems by providing parallel QA datasets that allow for comparative evaluation across languages such as English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese.

Motivation and Context

While QA systems in English have advanced rapidly, progress in non-English languages lags due to the scarcity of high-quality training and evaluation datasets. This scarcity prevents the measurement of true multilingual performance and limits the ability to train effective non-English QA models. To tackle these issues, MLQA offers a substantial aligned QA dataset that aims to promote cross-lingual research.

Dataset Construction

MLQA's dataset is carefully curated with sentences mined from Wikipedia, ensuring naturalness and parallel alignment across languages. This method avoids manual translation while leveraging naturally occurring parallel documents. The result is a dataset with more than 46,000 QA annotations, wherein each instance is typically parallel between four languages on average.

Tasks and Evaluation

The authors define two primary tasks for evaluating multilingual QA performance:

Cross-lingual Transfer (XLT): Models are trained on English data and tested on other languages.
Generalized Cross-lingual Transfer (G-XLT): Involves answering questions where the context and question languages differ, leveraging the parallel nature of MLQA.

For both tasks, zero-shot evaluation forms the core of their approach, where models trained on English datasets like SQuAD are directly applied to MLQA’s multilingual data.

Baseline Models and Results

The paper evaluates several baseline models:

Translate-Train: Translates SQuAD into target languages for training.
Translate-Test: Translates test data back into English to leverage English models.
Cross-lingual Representation Models: Utilizes pre-trained models like multilingual BERT and XLM for zero-shot transfer.

XLM emerged as the most effective model, showcasing robust performance in several languages, though a significant performance gap remains when compared to English.

Analysis and Findings

The analysis highlights the challenges multilingual models face, particularly in Arabic and Hindi due to their unique linguistic characteristics. "When" and "Who" questions are generally easier for models, while "Where" and "How" questions pose more challenges. An intriguing finding is that certain questions difficult in English may not necessarily be hard in target languages, hinting at the nuanced challenges in cross-lingual settings.

Implications and Future Work

MLQA represents an essential step in aligning QA research with multilingual ambitions. By providing a robust evaluation framework, it fosters enhanced model development for under-researched languages. Future directions might include leveraging MLQA for few-shot learning scenarios or exploring alternative training datasets beyond SQuAD.

In conclusion, MLQA offers significant contributions to cross-lingual QA research, emphasizing the need for consistent and comprehensive multilingual benchmarks to drive future advances in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Patrick Lewis (37 papers)
Ruty Rinott (4 papers)
Sebastian Riedel (140 papers)
Holger Schwenk (35 papers)
Barlas Oğuz (18 papers)

Citations (460)

View on Semantic Scholar