Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval (2502.20936v1)

Published 28 Feb 2025 in cs.CL, cs.AI, and cs.IR

Abstract: We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond WebFAQ - to other multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora spanning over 1000 language pairs using state-of-the-art bitext mining and automated LLM-assessed translation evaluation. Due to our advanced, automated method of bitext dataset generation, the resulting bilingual corpora demonstrate higher translation quality compared to similar datasets. WebFAQ and all associated resources are publicly available on GitHub and HuggingFace.

Summary

The paper presents a multilingual QA dataset of 96M Q&A pairs across 75 languages, advancing dense retrieval with innovative filtering and duplicate detection methods.
It leverages schema.org annotations along with fastText and XLM-RoBERTa for effective language detection and question type classification.
Experiments demonstrate that fine-tuning retrieval models on WebFAQ leads to substantial improvements over baselines like BM25 in zero-shot settings.

WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval

Introduction

The paper "WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval" presents a large-scale, open-domain dataset collection of question-answer pairs. Derived from schema.org annotations, WebFAQ comprises 96 million QA pairs across 75 languages, with a significant portion representing non-English samples. The dataset is designed to advance research in multilingual dense retrieval models, offering refined benchmarks through extensive filtering and duplicate detection.

Dataset Construction and Structure

WebFAQ capitalizes on structured § org annotations prevalent in FAQ-style datasets, facilitating the extraction of natural QA pairs from Common Crawl snapshots. By leveraging schema.org annotations, which embed rich metadata, WebFAQ not only catalogues several languages but also categorizes them based on topics and question types, providing a multidimensional dataset for diverse retrieval tasks.

Figure 1: Exemplary FAQ entries across languages.

Processing and Challenges

The dataset encompasses a wide array of topics, requiring sophisticated preprocessing techniques to manage near-duplicate questions and generic answers. Language detection employs fastText to classify the extracted pairs into 75 languages, while the question type classification utilizes XLM-RoBERTa to ensure robustness across multilingual domains.

Figure 2: Distribution of Question types with examples.

The application of rigorous filtering mechanisms, such as semantic consistency checks and near-duplicate detection, significantly refines the dataset, ensuring high-quality QA pairs suitable for dense retrieval models.

Evaluation of Dense Retrieval Models

Fine-tuning experiments demonstrate WebFAQ's utility in enhancing retrieval performance across languages. Pretraining XLM-RoBERTa models on MS MARCO followed by WebFAQ fine-tuning yields notable performance improvements on zero-shot retrieval benchmarks.

The experiments reveal substantial gains over established baselines like BM25 in monolingual contexts and underscore the diversity and richness of WebFAQ as a resource for multilingual retrieval tasks.

Bilingual QA Alignment

The paper introduces a novel bilingual corpus through advanced bitext mining leveraging LaBSE and GEMBA. The approach aligns QA pairs across languages, capitalizing on FAQs' inherent multilingual nature, which frequently presents congruent information across language versions.

This bilingual dataset not only surpasses existing resources in translation quality but also extends the applicability of WebFAQ by offering a robust basis for cross-lingual retrieval tasks.

Implications and Future Directions

WebFAQ addresses critical gaps in multilingual dense retrieval by offering a comprehensive, open-access dataset that encourages innovation in cross-lingual NLP research. The introduction of bilingual QA corpora enables a suite of applications from LLMing to machine translation evaluation.

Future research may expand the languages involved, explore enhanced bitext mining techniques, and refine retrieval performance further by integrating emergent models in NLP. The resources of WebFAQ set the stage for significant advancements in multilingual information retrieval, transcending traditional language barriers to achieve more inclusive, diverse AI systems.

Conclusion

WebFAQ is a valuable contribution to open-domain QA retrieval, facilitating improvements in multilingual dense retrieval systems through large-scale, high-quality data. Its open availability on platforms like GitHub allows for broad accessibility and subsequent advancements in the NLP community. The initiative paves the way for more sophisticated and inclusive AI models capable of handling the intricacies of multilingual question answering in real-world scenarios.