Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal (2005.10070v1)

Published 20 May 2020 in cs.CL

Abstract: Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Demian Gholipour Ghalandari (8 papers)
  2. Chris Hokamp (14 papers)
  3. Nghia The Pham (6 papers)
  4. John Glover (8 papers)
  5. Georgiana Ifrim (26 papers)
Citations (101)

Summary

We haven't generated a summary for this paper yet.