BanditSum: Extractive Summarization as a Contextual Bandit (1809.09672v3)

Published 25 Sep 2018 in cs.CL

Abstract: In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels. We call our approach BanditSum as it treats extractive summarization as a contextual bandit (CB) problem, where the model receives a document to summarize (the context), and chooses a sequence of sentences to include in the summary (the action). A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. We perform a series of experiments demonstrating that BanditSum is able to achieve ROUGE scores that are better than or comparable to the state-of-the-art for extractive summarization, and converges using significantly fewer update steps than competing approaches. In addition, we show empirically that BanditSum performs significantly better than competing approaches when good summary sentences appear late in the source document.

Authors (5)

Yue Dong (61 papers)
Yikang Shen (62 papers)
Eric Crawford (4 papers)
Herke van Hoof (38 papers)
Jackie Chi Kit Cheung (57 papers)

Citations (177)

View on Semantic Scholar

Summary

The paper presents a novel reformulation of extractive summarization as a contextual bandit problem using policy gradient reinforcement learning.
It achieves state-of-the-art ROUGE scores on CNN/Daily Mail datasets while requiring fewer update steps compared to traditional methods.
The approach eliminates exposure bias and dependence on pre-trained heuristic labels, enhancing summary quality and adaptability.

Overview of BanditSum: Extractive Summarization as a Contextual Bandit

The research paper presents BanditSum, a novel approach to extractive summarization that reframes the task as a contextual bandit problem, diverging from traditional sequential binary labeling methods. This reformulation leverages reinforcement learning (RL) techniques, specifically policy gradient methods, to optimize summarization quality without the need for heuristically-generated extractive labels. BanditSum seeks to improve upon existing methodologies, providing a model that achieves high ROUGE scores with fewer update steps than its predecessors.

Approach and Methodology

BanditSum represents extractive summarization as a contextual bandit problem, where a document serves as context, and choosing sentences for summarization equates to an action. The paper employs a policy gradient RL algorithm to train the model, which selects sentence sequences to maximize ROUGE scores. Importantly, this method does not suffer from exposure bias and obviates the need for pre-training with heuristically-generated labels, thereby addressing two key limitations of previous approaches. The sampling-without-replacement mechanism used in BanditSum ensures that sentence selection does not disproportionately favor earlier sentences, a significant improvement when the optimal summary sentences appear late in the document.

Research Contributions and Results

The paper makes several notable contributions:

Theoretical Grounding: It reformulates extractive summarization within the contextual bandit framework and demonstrates the application of policy gradient RL methods within this setting.
Experimental Validation: Performance comparisons across multiple datasets show that BanditSum achieves state-of-the-art results with fewer computational steps compared to other RL-based models like Refresh and SummaRuNNer.
Quality and Non-redundancy: Human evaluations suggest BanditSum summaries are perceived as higher quality and less redundant than competing solutions, highlighting the advantage in utilizing an exact policy gradient update.

Quantitatively, BanditSum achieves competitive ROUGE scores across CNN/Daily Mail datasets, notably performing better when summary-worthy sentences appear late in the document.

Implications and Future Directions

The implications of treating extractive summarization as a contextual bandit problem are profound. Removing the dependency on heuristic extractive labels not only simplifies the training process but also enhances flexibility and adaptability to the content structure. This approach's success may inspire further exploration into using RL in other aspects of natural language processing tasks, potentially leading to more innovative, efficient, and effective models.

Future research might consider incorporating additional rewards related to coherence or document structure to further enhance summary quality. Additionally, exploring different neural architectures for sentence affinity prediction may provide further improvements and insights into the interaction between document structure and summarization quality. The findings from BanditSum invite continued investigation into how context-based action selection can transform summarization tasks beyond the current extractive frameworks.

In conclusion, BanditSum presents a significant step forward in the quest for efficient, label-independent summarization models, providing a rigorous platform for enhancing extractive summarization through reinforcement learning techniques.

PDF Markdown