Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages (2003.05002v1)

Published 10 Mar 2020 in cs.CL and cs.LG

Abstract: Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA---a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology---the set of linguistic features each language expresses---such that we expect models performing well on this set to generalize across a large number of the world's languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don't know the answer yet, and the data is collected directly in each language without the use of translation.

Citations (543)

Summary

  • The paper presents a robust benchmark dataset with 204K question-answer pairs to evaluate multilingual QA systems across 11 diverse languages.
  • It collects native language data without translation, ensuring authentic challenges and complex relationships between questions and answers.
  • Baseline evaluations using mBERT reveal substantial performance gaps, highlighting the need for more advanced modeling techniques.

TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

TYDI QA represents a significant effort to advance multilingual question answering (QA) systems by introducing a dataset that spans 11 typologically diverse languages, encompassing 204,000 question-answer pairs. The primary aim is to construct a robust evaluation benchmark that not only challenges existing models but also facilitates their ability to generalize across a wide spectrum of the world's languages.

Key Objectives and Methodology

The authors delineate two principal objectives: developing QA systems that perform well in up to 100 languages and encouraging research into models that handle diverse linguistic phenomena. The dataset is collected directly in each language without translation to circumvent biases introduced by translationese, offering a more authentic representation of native language data.

In TYDI QA, questions are crafted by individuals seeking information without prior knowledge of the answers, which avoids high lexical overlap typical in datasets like SQuAD. This design leads to more complex relationships between questions and answers, inherently increasing the dataset's challenge.

Features and Linguistic Diversity

TYDI QA features languages selected based on typological distinctions and varying data availability, allowing for a comprehensive evaluation scenario. Languages include Arabic, Bengali, Finnish, Japanese, and others, each contributing unique linguistic features, such as agglutination in Korean or complex inflectional morphology in Kiswahili.

Notably, the inclusion of languages with limited parallel data (e.g., Kiswahili, Bengali) is particularly important for assessing model generalization capabilities in real-world scenarios, where translation data may not be readily available.

Evaluation and Results

The dataset introduces two primary tasks: Passage Selection and Minimal Answer Span, employing F1 as the main evaluation metric. A secondary Gold Passage task simplifies evaluation by focusing on pre-selected passage content. Notably, baseline results using mBERT show substantial gaps in performance compared to an estimate of human performance, underlining the dataset's difficulty and the need for more sophisticated modeling approaches.

Lexical overlap analysis across TYDI QA and other datasets, such as MLQA and XQuAD, reveals a markedly lower overlap in TYDI QA, indicative of its complexity and the nuanced relationship between questions and answers.

Implications and Future Work

The introduction of TYDI QA sets the stage for several avenues of exploration in multilingual modeling, including morphology's impact on question-answer matching, the role of machine translation for data augmentation, and the effectiveness of zero-shot learning.

The dataset invites researchers to engage with typological challenges and contributes significantly to the theoretical understanding of language representation in AI. Moreover, TYDI QA encourages the development of methods that potentially incorporate translation while acknowledging the scarcity of parallel data for certain languages.

Conclusion

Through TYDI QA, the authors have delivered a critical resource for evaluating and advancing multilingual QA systems. Its focus on typological diversity ensures that progress made on this dataset is more likely to generalize to the linguistic diversity present in global languages, potentially informing future developments in AI that better accommodate non-English text nuances. As the community builds upon this work, it is crucial to continuously expand testing languages and engage deeply with the diverse linguistic phenomena encapsulated in the TYDI QA dataset.