Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting-based Synthetic Data Generation for Few-Shot Question Answering (2405.09335v1)

Published 15 May 2024 in cs.CL

Abstract: Although LLMs (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using LLMs can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that LLMs contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Synthetic QA Corpora Generation with Roundtrip Consistency.
  2. Do Not Have Enough Data? Deep Learning to the Rescue! Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7383–7390.
  3. Jatin Arora and Youngja Park. 2023. Split-NER: Named Entity Recognition via Two Question-Answering-based Classifications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 416–426, Toronto, Canada. Association for Computational Linguistics.
  4. Language Models are Few-Shot Learners.
  5. How Optimal is Greedy Decoding for Extractive Question Answering?
  6. Rakesh Chada and Pradeep Natarajan. 2021. FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models.
  7. Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data Augmentation.
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  9. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension.
  10. Dialog State Tracking: A Neural Reading Comprehension Approach.
  11. Making Pre-trained Language Models Better Few-shot Learners.
  12. A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios.
  13. The Curious Case of Neural Text Degeneration.
  14. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.
  15. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5384.
  16. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics.
  17. Zero-Shot Relation Extraction via Reading Comprehension.
  18. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
  19. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation.
  20. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5849–5859, Online. Association for Computational Linguistics.
  21. Entity-Relation Extraction as Multi-Turn Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1340–1350, Florence, Italy. Association for Computational Linguistics.
  22. Named Entity Recognition without Labelled Data: A Weak Supervision Approach.
  23. Low-Resource NER by Data Augmentation With Prompting. volume 5, pages 4252–4258.
  24. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.
  25. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  26. Template-free Prompt Tuning for Few-shot NER. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5721–5732, Seattle, United States. Association for Computational Linguistics.
  27. Unsupervised Domain Adaptation of Language Models for Reading Comprehension.
  28. Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 436–442, Marseille, France. European Language Resources Association.
  29. Boosting Low-Resource Biomedical QA via Entity-Aware Masking Strategies.
  30. Training Question Answering Models From Synthetic Data.
  31. Language Models are Unsupervised Multitask Learners. undefined.
  32. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
  33. SQuAD: 100,000+ Questions for Machine Comprehension of Text.
  34. Few-Shot Question Answering by Pretraining Span Selection.
  35. Timo Schick and Hinrich Schütze. 2021. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.
  36. Improving Low-Resource Question Answering using Active Learning in Multiple Stages.
  37. B. Settles. 2012. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning Series. Morgan & Claypool.
  38. Towards Zero-Shot Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension.
  39. End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems.
  40. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
  41. NewsQA: A Machine Comprehension Dataset.
  42. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1):138.
  43. Attention Is All You Need.
  44. KECP: Knowledge Enhanced Contrastive Prompting for Few-shot Extractive Question Answering.
  45. PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks.
  46. From Clozing to Comprehending: Retrofitting Pre-trained Language Model to Pre-trained Machine Reader.
  47. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
  48. Multi-Stage Pre-training for Low-Resource Domain Adaptation.
  49. EntQA: Entity Linking as Question Answering.
  50. Factual Probing Is [MASK]: Learning vs. Learning to Recall.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com