Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training (2110.07731v2)

Published 14 Oct 2021 in cs.CL and cs.LG

Abstract: With the rise of large-scale pre-trained LLMs, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular LLMs to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Patrick Huber (146 papers)
  2. Armen Aghajanyan (31 papers)
  3. Dmytro Okhonko (11 papers)
  4. Wen-tau Yih (84 papers)
  5. Sonal Gupta (26 papers)
  6. Xilun Chen (31 papers)
  7. Barlas Oğuz (18 papers)
Citations (15)