Neural Summarization by Extracting Sentences and Words (1603.07252v3)

Published 23 Mar 2016 in cs.CL

Abstract: Traditional approaches to extractive summarization rely heavily on human-engineered features. In this work we propose a data-driven approach based on neural networks and continuous sentence features. We develop a general framework for single-document summarization composed of a hierarchical document encoder and an attention-based extractor. This architecture allows us to develop different classes of summarization models which can extract sentences or words. We train our models on large scale corpora containing hundreds of thousands of document-summary pairs. Experimental results on two summarization datasets demonstrate that our models obtain results comparable to the state of the art without any access to linguistic annotation.

PDF Abstract

Neural Summarization by Extracting Sentences and Words

The intricacies of automatic summarization reside deeply within identifying salient information from large textual datasets. While traditional extractive summarization methods rely heavily on human-engineered features, the paper "Neural Summarization by Extracting Sentences and Words" by Jianpeng Cheng and Mirella Lapata introduces a data-driven neural approach that eschews these labor-intensive features in favor of continuous sentence representations. This methodological shift is significant in the context of NLP as it leverages the encoder-extractor architecture and a hierarchical document reader coupled with a neural attention-based extractor.

The proposed framework comprises two primary models for summarization, operating either at the sentence level or word level. This dual focus allows the models to capture essential information at varying granularities—either by identifying complete, salient sentences or by extracting and arranging individual words to form coherent summaries. Crucially, the models are trained on large-scale corpora consisting of hundreds of thousands of document-summary pairs, an ambitious move that stands in stark contrast to previous approaches limited by significantly smaller training datasets.

Architectural Overview

The paper proposes a hierarchical document reader designed to process and encode documents at multiple levels of granularity. The process starts with a convolutional neural network (CNN) that encodes sentences from sequences of words. The final sentence representations are then composed into document vectors using a recurrent neural network (RNN) with Long Short-Term Memory (LSTM) units, capturing both local and global sentential information. This hierarchical architecture ensures that the document’s structure is inherently reflected in the model’s representations.

For extractive summarization at the sentence level, an attention-based sentence extractor is utilized. This extractor sequentially labels sentences while accounting for redundancy and informativeness, without needing intermediate steps to translate the encoder's hidden states into focus regions for the next word. The sequential labeling mitigates errors by employing a curriculum learning strategy during training, ensuring that decisions on sentence relevance are progressively better aligned with the actual labels.

In contrast, the word extractor formulates summarization as a conditional language generation task restricted to the document's vocabulary—a strategy that sidesteps the difficulties of generating under an open vocabulary. This hierarchical attention model first attends to sentences and then to words, creating the summary in a way akin to extraction but executed with a language generation model.

Practical Implications and Results

The efficacy of the proposed models was demonstrated through rigorous evaluation on two datasets: the DUC-2002 single document summarization corpus and a custom dataset drawn from the DailyMail news articles. The models achieved ROUGE scores that compare favorably with state-of-the-art extractive systems, particularly in the case of the neural sentence extractor (nn-se) which consistently performed better than both baseline methods and some established competitive systems.

In human evaluations, the sentence extraction model was also ranked highly in terms of informativeness and fluency, closely following the human-authored summaries. This result underscores the model's ability to generate high-quality summaries without sophisticated linguistic constraints or hand-engineered features.

Future Developments

The approaches detailed in this paper open several avenues for future research. Enhancing the word extraction model by incorporating relational information from tree-based algorithms or leveraging phrase-based extraction techniques could further improve summary coherence and grammaticality. Another promising direction could be to explore purely unsupervised methodologies guided by information-theoretic principles, thus reducing the dependence on large labeled datasets.

The neural methodologies presented provide a robust foundation for the development of more advanced, context-aware summarization systems. Future advancements could focus on incorporating fine-grained syntactic and semantic information, potentially bridging the gap between extractive and abstractive summarization paradigms.

Conclusion

The proposed neural summarization framework effectively captures document content by relying on hierarchical, attention-based models that eschew manual feature engineering. This shift towards data-driven summarization represents a significant step in the development of automatic summarization systems, offering scalable solutions capable of handling large corpora while maintaining or enhancing the performance benchmarks set by traditional methods. The emphasis on sentence and word extraction models points to a flexible, robust approach that lays the groundwork for future innovations in the automatic summarization landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Jianpeng Cheng (19 papers)
Mirella Lapata (135 papers)

Citations (800)

View on Semantic Scholar

Neural Summarization by Extracting Sentences and Words (1603.07252v3)