Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M2QA: Multi-domain Multilingual Question Answering (2407.01091v1)

Published 1 Jul 2024 in cs.CL

Abstract: Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at https://github.com/UKPLab/m2qa.

An Overview of M2QA: Multi-domain Multilingual Question Answering

The paper "M2QA: Multi-domain Multilingual Question Answering" addresses a pertinent gap in NLP by introducing M2QA, a comprehensive benchmark designed to evaluate NLP models on their ability to perform multi-domain, multilingual question answering. The dataset includes 13,500 question-answer instances similar to SQuAD 2.0, focusing on three languages (German, Turkish, Chinese) and three domains (product reviews, news, creative writing).

Core Contributions

To understand M2QA's role, it is critical to appreciate its contributions:

  1. Dataset Creation:
    • The authors meticulously curated a dataset featuring 1500 manually annotated question-answer pairs for each language-domain combination. This effort encompasses naturally occurring texts rather than translations, thereby preserving linguistic and cultural nuances.
  2. Evaluation Protocols:
    • The dataset facilitates rigorous evaluation by covering typologically distinct languages and diverse domains, and the authors highlight performance variability across language-domain combinations.
  3. Baselines and Modular Setups:
    • The authors investigate multiple models including fine-tuned baselines (XLM-R) and advanced LLMs (GPT-3.5, Llama 2/3, Aya 23) using both zero-shot and few-shot learning paradigms.
  4. Modular Transfer Learning:
    • They propose two modular approaches: MAD-X+Domain and MAD-X, incorporating language and domain adapters to explore transfer learning methodologies.
  5. Extensive Analysis:
    • The paper presents an insightful breakdown of how model performance varies significantly across language and domain pairings, calling attention to unsolved challenges in both linguistic and domain-specific generalization.

Experiment Results and Insights

Performance Assessment

The authors provide a detailed evaluation using their newly proposed benchmark:

  • LLMs: GPT-3.5-turbo-0613 exhibited the best overall performance with an F1 score of 53.11, closely followed by Aya 23 (F1 51.61), indicating that existing LLMs still struggle significantly with transferring across less resourced languages and varied domains.
  • Modular Approaches: The MAD-X and MAD-X+Domain setups showed promise, particularly reducing the computational expense for training while retaining comparative performance levels to XLM-R baselines. However, they witnessed performance drops when encountering typologically unique languages such as Chinese, exacerbated by the absence of whitespace in text segmentation.

Numerical Results

The reported results underscore critical gaps:

  • XLM-R has an average performance (F1) around 37.73, while further domain-specific pre-finetuning (XLM-R) achieves a marginally lower F1 score of 36.36, suggesting limited benefits from intermediate domain training in isolation.
  • The MAD-X and MAD-X+Domain setups achieve comparable performance across domains, with MAD-X particularly being advantageous due to efficiency in training steps (250,000 vs 1,000,000) and demonstrating enhanced computational scalability.

Analytical Perspective

The findings elucidate the complexities of handling joint language and domain generalization:

  • Tokenization Issues: The SQuAD evaluation metric's reliance on whitespace tokenization leads to substantial performance drops for languages like Chinese, highlighting the necessity for adapted metrics.
  • Model Adaptivity: Modular approaches like MAD-X concurrently applied for domain and language adaptation showcased the potential to encapsulate distinct task-specific knowledge efficiently, albeit with mixed success across differing languages.

Implications and Future Directions

Practical and Theoretical Implications:

From a practical perspective, M2QA's design paves the way for nuanced evaluation of NLP models tailored to diverse linguistic and cultural contexts. Theoretical inquiries into modular transfer learning are enriched by grounding insights from comprehensive baselines and novel methodologies.

Future Directions:

  • Dataset Expansion: Expanding M2QA in terms of additional low-resource languages and underrepresented domains remains pivotal.
  • Adapter Efficiency: Enhanced optimization techniques for adapter-based learning could further refine the balance between performance and computational costs.
  • Human Benchmarking: Establishing human performance baselines will benchmark the progress made by automated systems, providing clarity on the trajectory towards human-level comprehension and reasoning in multilingual, multi-domain contexts.

In conclusion, M2QA serves as a benchmark that significantly advances our ability to evaluate and improve NLP models’ cross-lingual and cross-domain generalization capabilities. The insights derived from the thorough experiments and analyses in this paper lay foundational work that will drive future research towards more robust and versatile language understanding systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Leon Engländer (2 papers)
  2. Hannah Sterz (5 papers)
  3. Clifton Poth (6 papers)
  4. Jonas Pfeiffer (34 papers)
  5. Ilia Kuznetsov (19 papers)
  6. Iryna Gurevych (264 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com