LexRank: Graph-based Lexical Centrality as Salience in Text Summarization (1109.2128v2)

Published 9 Sep 2011 in cs.CL

Abstract: We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.

Citations (3,025)

View on Semantic Scholar

Summary

The paper introduces LexRank, a novel graph-based method that computes sentence salience using eigenvector centrality.
It compares Degree Centrality, LexRank with threshold, and Continuous LexRank, achieving higher ROUGE scores on DUC datasets.
The approach enhances multi-document summarization by effectively capturing inter-sentence relationships to rank sentence importance.

LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

In the paper "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization," Erkan and Radev introduce an innovative approach for assessing sentence importance within the context of NLP, specifically geared toward the task of Text Summarization (TS). Their methodology diverges from traditional summarization techniques by leveraging a graph-based model rooted in the concept of eigenvector centrality.

Introduction

The paper situates itself in the evolving landscape of statistical techniques in NLP, foregrounding the utility of graph-based approaches that have proven effective in areas like word clustering and prepositional phrase attachment. Erkan and Radev extend this paradigm to multi-document extractive text summarization, which seeks to distill the core content of multiple documents into a concise summary.

Problem and Approach

Traditional extractive text summarization methods prioritize identifying key sentences based on heuristic features such as sentence position, word frequency, and the presence of specific phrases. However, these methods often overlook the broader relational context among sentences. The authors propose LexRank, a method that computes sentence importance using eigenvector centrality within a graph representation of sentences.

Centrality and Graph Representation

Key to their approach is the construction of a graph where nodes represent sentences and edges represent cosine similarities between sentences, scaled by idf-modified cosine metrics. This graph-based model allows for a holistic assessment of sentence importance based on the overall structure and connectivity within the document cluster.

Degree Centrality

The authors first explore "Degree Centrality," a measure based on counting the number of significant similarity connections each sentence has. While effective, this measure treats each connection equally, without leveraging the hierarchical nature of sentence importance.

LexRank with Threshold

To address the egalitarian limitations of Degree Centrality, the authors introduce "LexRank," which integrates eigenvector centrality by propagating the centrality scores through the graph iteratively—factorizing the centrality contributions by the connectivity degree of adjacent nodes. LexRank incorporates a damping factor, akin to the PageRank algorithm, to ensure irreducibility and aperiodicity, guaranteeing convergence to a unique centrality vector.

Continuous LexRank

Further refining the approach, Erkan and Radev propose "Continuous LexRank," which eliminates the need for discretization thresholds and utilizes the actual cosine similarity values directly, thus preserving the granular information in the graph.

Experimental Evaluation

The paper thoroughly evaluates these methods using the DUC 2003 and 2004 datasets, supplemented by additional cross-lingual datasets. Summaries generated by various implementations of LexRank were compared against traditional centroid-based summarization and baseline methods. Performance was assessed using the ROUGE metric, specifically ROUGE-1, which has been shown to correlate strongly with human judgment.

Results

The authors report that all three centrality-based methods—Degree Centrality, LexRank with threshold, and Continuous LexRank—consistently outperform baseline and centroid-based summarization methods. Notably, LexRank with a low threshold and Continuous LexRank yield particularly strong results, often scoring higher than competing systems in DUC evaluations. The robustness to noise and introduction of extraneous documents further underscores the efficacy of their proposed methods.

Implications and Future Work

The findings from this paper have significant implications for both theoretical and practical advancements in text summarization. The introduction of graph-based centrality offers an alternative approach that inherently accounts for inter-sentence relationships, providing a more nuanced mechanism for identifying salient information.

Looking forward, the authors suggest potential extensions of graph-based centrality methods into other NLP tasks. Their ongoing work includes applying random walk methods to bipartite graphs for semi-supervised learning, which may further enhance the adaptability and precision of NLP applications ranging from document classification to spam detection.

Conclusion

Erkan and Radev's LexRank represents a meaningful shift towards leveraging graph-theoretic measures in text summarization. Their comprehensive analysis and empirical results validate the superiority of centrality-based sentence salience metrics over traditional methods. The proposed methodology not only advances the state-of-the-art in extractive summarization but also opens new avenues for applying graph-based techniques across various NLP domains.

PDF Markdown