Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model (1906.01749v3)

Published 4 Jun 2019 in cs.CL
Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Abstract: Automatic generation of summaries from multiple news articles is a valuable tool as the number of online publications grows rapidly. Single document summarization (SDS) systems have benefited from advances in neural encoder-decoder model thanks to the availability of large datasets. However, multi-document summarization (MDS) of news articles has been limited to datasets of a couple of hundred examples. In this paper, we introduce Multi-News, the first large-scale MDS news dataset. Additionally, we propose an end-to-end model which incorporates a traditional extractive summarization model with a standard SDS model and achieves competitive results on MDS datasets. We benchmark several methods on Multi-News and release our data and code in hope that this work will promote advances in summarization in the multi-document setting.

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

The paper introduces Multi-News, the inaugural large-scale dataset for multi-document summarization (MDS) in the news domain, addressing a crucial gap in the field of NLP. Unlike single-document summarization (SDS), MDS systems have lagged due to limited dataset sizes and less leverage of recent advancements in neural models. The Multi-News dataset consists of 56,216 article-summary pairs, substantially surpassing previous datasets, such as DUC 2004 and TAC 2011, which contained under 100 document clusters.

The dataset was curated from newser.com, featuring professional human-written summaries with sufficient document diversity and source variability, critical for training robust models. The paper's key contribution lies not only in providing this dataset but also in proposing a new abstractive hierarchical model, termed as Hi-MAP (Hierarchical MMR-Attention Pointer-generator), which combines traditional extractive strategies with neural abstractive approaches using Maximal Marginal Relevance (MMR) to effectively handle redundancy and relevancy in the summarization task.

The authors benchmark various methods on the Multi-News dataset, incorporating extractive models like LexRank and TextRank, and notable neural architectures such as the pointer-generator network and CopyTransformer. The results illustrate that the Hi-MAP model delivers competitive performance, especially on metrics such as ROUGE scores, compared to previous state-of-the-art methods.

Key Insights and Numerical Results

  1. Dataset Characteristics: Multi-News comprises over 56,000 pairs with an average of 2,103 words per document and 263 words per summary. It showcases significant diversity, drawing from more than 1,500 distinct news sources.
  2. Novelty in Abstraction: The analysis indicates that around 17.76% of unigrams in the dataset's summaries are not present in the source documents, underscoring its potential for modeling both extractive and abstractive techniques effectively.
  3. Performance Metrics: Hi-MAP demonstrated improved R-2 and R-SU scores against the CopyTransformer, with scores reflecting substantial improvements in handling redundancy.

Implications and Future Directions

The introduction of Multi-News marks a substantial advancement in the research on MDS, paving the way for enhanced deep learning models capable of synthesizing comprehensive summaries from varied document clusters. The dataset's size and applicability encourage the adoption of transformer-based models and novel neural architectures in future investigations.

Practically, this work has notable implications for real-world applications involving media summary generation, aiding stakeholders in digesting voluminous information efficiently. Theoretically, it facilitates the exploration of document interrelations, coherence maintenance, and narrative consistency in AI-generated textual outputs.

As NLP continues to evolve, the development of more sophisticated sequence-to-sequence models and the embedding of inter-document context understanding remain viable paths for further refinement. Future research could also focus on scaling input lengths and streamlining coherence in longer summaries, addressing challenges rooted in summarization fidelity and factual consistency.

In summary, the paper contributes significantly to the MDS domain by providing a comprehensive dataset and a novel model architecture, both of which have the potential to catalyze further advancements in understanding and generating cohesive, contextually rich summaries from multiple document sources.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexander R. Fabbri (34 papers)
  2. Irene Li (47 papers)
  3. Tianwei She (6 papers)
  4. Suyi Li (26 papers)
  5. Dragomir R. Radev (14 papers)
Citations (523)