Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NewsQs: Multi-Source Question Generation for the Inquiring Mind (2402.18479v2)

Published 28 Feb 2024 in cs.CL

Abstract: We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289. Publisher: Elsevier.
  2. Yake! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, pages 806–810. Springer.
  3. YAKE! Collection-Independent Automatic Keyword Extractor. In Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury, editors, Advances in Information Retrieval, volume 10772, pages 806–810. Springer International Publishing, Cham. Series Title: Lecture Notes in Computer Science.
  4. CONSISTENT: Open-Ended Question Generation From News Articles.
  5. Mark Davies. 2022. Corpus of News on the Web (NOW).
  6. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392. Association for Computational Linguistics. Event-place: Melbourne, Australia.
  7. Question Generation for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, Copenhagen, Denmark. Association for Computational Linguistics.
  8. DUC. 2007. Document Understanding Conferences - Past Data.
  9. A feasibility study of answer-agnostic question generation for education. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1919–1926, Dublin, Ireland. Association for Computational Linguistics.
  10. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  11. ELI5: Long Form Question Answering. arXiv:1907.09190 [cs]. ArXiv: 1907.09190.
  12. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, Online. Association for Computational Linguistics.
  13. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. ArXiv:1711.05073 [cs].
  14. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  15. AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization.
  16. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
  17. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  18. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  19. Language Models are Unsupervised Multitask Learners. page 24.
  20. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat]. ArXiv: 1910.10683.
  21. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207 [cs]. ArXiv: 2110.08207.
  22. ASQA: Factoid Questions Meet Long-Form Answers. arXiv:2204.06092 [cs]. ArXiv: 2204.06092.
  23. SQuALITY: Building a Long-Document Summarization Dataset the Hard Way. arXiv:2205.11465 [cs]. ArXiv: 2205.11465.
  24. BERTScore: Evaluating Text Generation with BERT.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Alyssa Hwang (10 papers)
  2. Kalpit Dixit (3 papers)
  3. Miguel Ballesteros (70 papers)
  4. Yassine Benajiba (21 papers)
  5. Vittorio Castelli (24 papers)
  6. Markus Dreyer (14 papers)
  7. Mohit Bansal (304 papers)
  8. Kathleen McKeown (85 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com