Continuous LexRank
- The paper introduces continuous LexRank, replacing binary similarity thresholds with fully weighted cosine similarity graphs to capture nuanced contextual relations.
- Continuous LexRank employs a stochastic random walk with a damping factor to compute sentence centrality, analogous to the PageRank algorithm.
- Empirical evaluations on DUC datasets show that continuous LexRank achieves consistent ROUGE-1 improvements and robust performance in noisy conditions.
Continuous LexRank is a graph-based method for computing sentence salience in extractive multi-document text summarization. It generalizes the original LexRank algorithm by using fully weighted, real-valued sentence similarity graphs rather than thresholded binary graphs. This approach represents sentences as nodes in a continuous cosine similarity network, enabling fine-grained measurement of semantic proximity for centrality-based ranking. The algorithm employs a stochastic random-walk with damping, analogous to PageRank, to compute the stationary centrality distribution over sentences. Empirical analysis demonstrates its competitive performance in large-scale summarization evaluations and its robustness to noisy or imperfect topical clusters.
1. Continuous LexRank: Definition and Core Principles
Continuous LexRank formulates summarization as the identification of the most "central" sentences in a multi-sentence and multi-document cluster based on inter-sentence lexical similarity. Each sentence is embedded as a high-dimensional tf·idf vector , where the dimension for word is the product of 's term frequency in and its inverse document frequency, quantifying both local and global importance. Sentence pairs are then scored by cosine similarity:
A weighted adjacency matrix is created, leveraging the full real-valued similarity range rather than binarizing with a threshold as in traditional (thresholded) LexRank. This yields a denser, information-preserving sentence graph.
2. Random Walk Centrality Computation
The algorithm constructs a row-stochastic transition matrix by normalizing the rows of so each sums to one. Centrality is then defined as the stationary distribution of a random walk with damping factor . A uniform "teleportation" matrix with all entries $1/N$ (where is the number of sentences) is incorporated to ensure ergodicity:
The final centrality vector satisfies:
or in expanded form,
This formulation is mathematically equivalent to PageRank except the graph is undirected due to symmetric cosine similarity, and self-links are typically present. Numerically, is computed iteratively by the power method until convergence (), initializing with the uniform distribution.
3. Thresholded versus Continuous LexRank
Standard LexRank ("thresholded LexRank") constructs an adjacency matrix where if , and $0$ otherwise; is typically optimized (e.g., yields strongest results). This approach encapsulates only the "strongest" edges, effectively discarding nuanced similarity information below the threshold. By contrast, continuous LexRank retains all computed cosine similarities, resulting in a graph better reflecting the full spectrum of contextual relatedness within the cluster. Empirical results indicate that continuous LexRank marginally but consistently outperforms thresholded variants by avoiding the information loss of binary discretization (Erkan et al., 2011).
4. Empirical Performance and DUC Evaluations
Extensive evaluations were conducted on DUC 2003/2004 and cross-lingual DUC 2004 datasets, using 665-byte generic extractive summaries and ROUGE-1 recall as the main metric. Comparative assessments included random sentence selection, lead-based summaries, centroid methods (tf·idf centroid pseudo-documents), degree centrality, thresholded LexRank, and continuous LexRank. The damping parameter was fixed at , and for thresholded methods the optimal threshold was found at .
A summary of representative ROUGE-1 results:
| Dataset / Method | Centroid | Degree (t=0.1) | LexRank (t=0.1) | Continuous LexRank |
|---|---|---|---|---|
| DUC2003 Task2 | 0.362 | 0.360 | 0.367 | 0.365 |
| DUC2004 Task2 | 0.367 | 0.371 | 0.374 | 0.376 |
| DUC2004 Task4a (MT) | 0.383 | 0.393 | 0.397 | 0.396 |
| DUC2004 Task4b (Human) | 0.403 | 0.403 | 0.405 | 0.397 |
Both thresholded and continuous LexRank consistently outperform centroid and lead baselines, with continuous LexRank exhibiting the highest ROUGE-1 scores on most tasks—although the gains over thresholded LexRank are minor. According to the DUC official rankings, these LexRank variants consistently place among the top systems, frequently within the 95% confidence interval of the leading peer (Erkan et al., 2011).
5. Robustness to Topical Noise
Continuous LexRank demonstrates strong robustness to noisy clusters. Experiments where 2 unrelated “off-topic” documents were injected into each 12-document DUC cluster (≈17% noise) revealed that continuous LexRank’s ROUGE-1 dropped by less than 0.01 absolute (e.g., from 0.376 to 0.369 on DUC2004 Task2). In contrast, baseline systems such as lead or random selection showed much larger degradations. This insensitivity to clustering imperfections is attributed to the random-walk nature of prestige flow, which dilutes the influence of isolated or off-topic nodes via global centrality computation (Erkan et al., 2011).
6. Advances: Continuous-Similarity Graphs and Joint Ranking Extensions
Recent graph-based summarization methods such as RepRank (Li et al., 2020) generalize the continuous LexRank approach by leveraging continuous sentence and word embeddings (e.g., GloVe and self-attention representations), and constructing joint sentence/word/keyword similarity graphs. RepRank maintains all edge weights as real-valued cosine similarities and performs a joint random walk over both sentences and keywords in a unified eigenvector problem. Experimental results on DUC-2002 and DUC-2007 reported higher ROUGE-1/2 than standard LexRank, indicating that continuous, embedding-based similarity matrices capture semantic relatedness beyond surface lexical overlap. The absorbing random walk variant further improves redundancy handling with only minor performance tradeoffs. This suggests that the conceptual framework of continuous LexRank is a foundation for further advances in joint and semantic-centrality summarization algorithms (Li et al., 2020).
7. Summary and Significance
Continuous LexRank formalizes sentence importance in summarization as eigenvector centrality within a fully weighted, undirected similarity graph. It replaces binary edge thresholding with fine-grained continuous affinities, computes stationary centrality via a damped power method, and consistently matches or outperforms both traditional centroid baselines and thresholded LexRank methods on ROUGE metrics. Its resilience to noisy cluster composition and its extensions in embedding-driven frameworks affirm its ongoing relevance as a principled, robust approach for extractive summarization grounded in global sentence similarity structure (Erkan et al., 2011, Li et al., 2020).