XQuAD: Cross-lingual QA Dataset
- XQuAD is a cross-lingual extractive QA benchmark derived from SQuAD v1.1, enabling zero-shot evaluation across ten languages.
- The dataset uses professional translations and unique answer span placeholders to ensure consistent, high-fidelity multilingual results.
- It employs SQuAD evaluation metrics (EM and F₁) and highlights challenges such as vocabulary capacity and embedding learning instability in diverse languages.
The Cross-lingual Question Answering Dataset (XQuAD) is a rigorously constructed evaluation resource designed to benchmark zero-shot cross-lingual extractive question answering capabilities of LLMs. Built by extending a subset of the SQuAD v1.1 development set through professional translation into ten target languages, XQuAD enables controlled measurement of transfer performance in settings where models are trained on English and tested on diverse, typologically-heterogeneous languages. The dataset preserves direct span alignment through a system of explicit answer span placeholder tokens and standardized translation protocols, providing an unambiguous, gold-standard resource for evaluating multilingual question answering systems (Artetxe et al., 2019).
1. Construction Methodology
XQuAD derives from 48 SQuAD v1.1 documents, each contributing five sampled paragraphs, resulting in 240 context passages. Each passage, along with its set of associated English questions and answers (1190 question–answer pairs overall), is translated by professional linguists into ten target languages: Spanish (es), German (de), Greek (el), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), and Hindi (hi). Named entity transliteration adheres to Wikipedia conventions per language.
To ensure cross-lingual answer span consistency, answer spans in English contexts are delimited with unique markers (e.g., *0*answer text#0#). Translators are instructed to preserve these markers exactly in the translated contexts. An online validator checks and enforces the placeholder protocol, eliminating annotation drift and ensuring precise answer alignment across all language versions (Artetxe et al., 2019).
2. Dataset Organization and Statistics
All translated languages maintain the original set of 240 paragraphs and 1190 question–answer pairs, yielding a per-language breakdown that is strictly parallel. Cumulatively, across all languages, XQuAD provides 2,640 context passages and 11,900 question–answer pairs.
Average token counts (tokenized by Moses for all but Chinese, which uses Jieba) vary substantially by language and text type. For example, average paragraph tokens: English 142.4, Spanish 160.7, German 139.5, Hindi 232.4. Average question tokens: English 11.5, Hindi 18.7. Average answer tokens: English 3.1, Hindi 5.6. This diversity in linguistic structure and tokenization granularity is representative of the cross-lingual transfer challenge (Artetxe et al., 2019).
| Language | Paragraph Tokens | Question Tokens | Answer Tokens |
|---|---|---|---|
| English | 142.4 | 11.5 | 3.1 |
| Spanish | 160.7 | 13.4 | 3.6 |
| German | 139.5 | – | – |
| Hindi | 232.4 | 18.7 | 5.6 |
3. File Structure and Data Format
XQuAD adopts an extractive-QA data structure similar to SQuAD v1.1, with the main distinction being the explicit answer span placeholders. Each context is accompanied by its questions and corresponding ground truth answer spans, where answers are unique substrings of the context text delimited by the markers. The data is provided per-language, with the same set of question–answer pairs for each language.
A minimal JSON instance is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{
"id": "<unique-qa-id>",
"context": "… text before *0*answer text#0# text after …",
"qas": [
{
"question": "Translated question text …",
"id": "<same-id-as-above>",
"answers": [
{
"text": "*0*answer text#0#",
"answer_start": <character_index_of_*0*>
}
]
}
]
} |
4. Evaluation Protocol and Metrics
Model evaluation on XQuAD mirrors that of SQuAD v1.1, using:
- Exact Match (EM): A score of 1 if the model’s prediction exactly matches any ground-truth answer string post-normalization (including placeholders); 0 otherwise.
- F₁ Score: Computed as the harmonic mean of token-level precision and recall between the predicted answer and gold answer, tokenized on whitespace or subword boundaries.
Let be the set of predicted tokens and the set of ground-truth tokens: Evaluation operates over subword tokenizations (SentencePiece unigram model) with all placeholder tokens preserved. No additional language- or script-specific normalization is performed beyond this scheme (Artetxe et al., 2019).
5. Experimental Practices and Recommended Splits
XQuAD is strictly an evaluation benchmark; it contains only translated versions of SQuAD’s dev set, not independent train or test splits. Model training employs the SQuAD training set (in English), with zero-shot evaluation conducted on XQuAD per language. No finetuning is performed on the target-language data.
For experimental consistency, input texts should be pre-tokenized using SentencePiece (unigram) for model use, and Moses or Jieba only for reporting corpus statistics. The SQuAD EM/F₁ scoring scripts are recommended for uniformity. Results are reported both per-language and as a macro-average across all ten target languages (Artetxe et al., 2019).
6. Selected Examples and Linguistic Alignment
A canonical XQuAD sample illustrates gold-standard span alignment and translation methodology. In English, the context “… derive from various sources, most commonly from 1*burning combustible materials#1 … (called … 2*combustion chamber#2 …). 3*solar#3 energy … toy steam engines … 4*electric#4 heating element.” is mirrored by high-fidelity translations into Spanish and Chinese, each preserving explicit answer boundaries for robust span extraction. All translated questions are full, semantically-faithful renderings of the source English questions.
7. Notable Insights, Challenges, and Best Practices
XQuAD imposes a more demanding evaluation scenario than cross-lingual text classification (e.g., XNLI) by requiring precise span extraction, amplifying zero-shot generalization gaps. Critical design findings include:
- Vocabulary Capacity: Models with large, disjoint subword vocabularies per language demonstrate stronger transfer performance than those with a limited shared vocabulary.
- Embedding Learning Instability: Learning language-specific position embeddings can encounter convergence issues in languages such as Turkish and Hindi, sometimes necessitating repeated optimization.
- Adapters Effect: Integrating residual adapter modules in transformer layers significantly reduces the transfer gap versus joint multilingual models.
- Noise Injection: Embedding-level Gaussian noise during fine-tuning on English improves robustness and transfer generalization.
- Translation Rigor: The enforced placeholder protocol plus high-quality translation is vital for answer span consistency, establishing a best-practice framework for future multilingual QA dataset development.
- Evaluation Practices: Standardized, span-extracting evaluation scripts, explicit answer delimiters, and per-language breakdowns are essential for reproducible, meaningful benchmarking (Artetxe et al., 2019).
XQuAD constitutes a rigorously aligned, multi-language extension of SQuAD v1.1, setting a robust standard for zero-shot cross-lingual extractive question answering evaluation and underpinning methodological advances in cross-lingual representation learning.