Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LMGQS: A Large-scale Dataset for Query-focused Summarization (2305.13086v1)

Published 22 May 2023 in cs.CL

Abstract: Query-focused summarization (QFS) aims to extract or generate a summary of an input document that directly answers or is relevant to a given query. The lack of large-scale datasets in the form of documents, queries, and summaries has hindered model development in this area. In contrast, multiple large-scale high-quality datasets for generic summarization exist. We hypothesize that there is a hidden query for each summary sentence in a generic summarization annotation, and we utilize a large-scale pretrained LLM to recover it. In this way, we convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS, which consists of over 1 million document-query-summary samples. We thoroughly investigate the properties of our proposed dataset and establish baselines with state-of-the-art summarization models. By fine-tuning a LLM on LMGQS, we achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks, demonstrating the high quality and diversity of LMGQS.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ruochen Xu (35 papers)
  2. Song Wang (313 papers)
  3. Yang Liu (2253 papers)
  4. Shuohang Wang (69 papers)
  5. Yichong Xu (42 papers)
  6. Dan Iter (16 papers)
  7. Chenguang Zhu (100 papers)
  8. Michael Zeng (76 papers)
Citations (4)