Training Question Answering Models From Synthetic Data
This paper, authored by Raul Puri and colleagues, explores the potential of utilizing synthetic data to train Question Answering (QA) models, specifically addressing the performance of such models when benchmarked against human-generated datasets. The motivation arises from the significant cost and scarcity of labeled training data, which poses a substantial hindrance in the development of high-performing QA models. The authors posit that synthetic question-answer pairs, generated via LLMs, can potentially bridge the gap presently seen between synthetic and human-generated data.
The paper reports training a QA model exclusively with synthetic data generated using an 8.3 billion parameter GPT-2 model, achieving notable results of 88.4 EM and 93.9 F1 scores on the SQuAD1.1 dev set. In comparison to a baseline leveraging solely the SQuAD1.1 training set, the synthetic-only approach not only achieved comparable performance but in some cases, exceeded it. Moreover, the paper reports a substantive 2.8 point increase in EM scores for SQuAD2.0 when employing purely synthetic data, relative to previous works on synthetic data.
Key to this achievement is a three-step question generation pipeline:
- Answer Generation: The paper employs a BERT-based span selection model to extract answer candidates from given text corpora. This model does not require explicit question input, facilitating a broad selection of potential answers from the text.
- Question Generation: A pretrained GPT-2 model, modified for question generation, creates questions from the extracted answers. The model leverages LLMing tasks to improve the quality of generated questions, showing increased effectiveness with larger model sizes.
- Question Filtration: Employing roundtrip consistency, questions are filtered using a QA model to ensure generated questions are coherent and answerable based on the generated answers. An overgeneration technique, producing multiple questions per answer, further enhances this filtration process.
The implications of this work are significant, suggesting that the reliance on human-generated datasets could be mitigated or even eliminated, allowing QA model training to scale more efficiently with synthetic question-answer pairs. This approach promises to revolutionize the manner in which QA systems are developed and trained, particularly in domains where labeled data is scarce or costly to obtain.
Practically, the implications reach beyond the SQuAD datasets, as the methodologies discussed—particularly the improved utility of large transformer models—could be extended to various types of QA datasets including open-domain, multi-hop, and conversational QA. The use of synthetic data generation holds promise in aiding other NLP tasks which require diverse heuristics and data, such as dialogue systems and information retrieval.
Finally, the paper anticipates future research exploring more advanced filtering and generation techniques, potentially examining the impact of more granular control over answer types and further scaling up LLMs. Such directions could possibly lead to QA models equivalent or superior to those trained on exhaustively curated human data, opening new avenues in AI and NLP research.