LexRank: Graph-Based Text Summarization

Updated 6 March 2026

LexRank is an unsupervised graph-based algorithm for extractive text summarization that uses eigenvector centrality to rank sentence importance.
It constructs a sentence similarity graph using tf–idf weighted cosine similarity, connecting sentences above a similarity threshold.
Variants such as Guided LexRank, FastLexRank, and L1-robust LexRank extend its utility across diverse domains and improve computational efficiency.

LexRank is an unsupervised, graph-based algorithm for extractive text summarization that computes sentence salience via eigenvector centrality in a sentence-similarity graph. Originating in the context of multi-document summarization, LexRank is domain- and language-agnostic, with further extensions and robustifications across a wide range of domains, including legal analysis, patent summarization, music information retrieval, infodemic management, and large-scale social media content structuring (Erkan et al., 2011, Raposo et al., 2014, Zhang et al., 2024, Jayatilleke et al., 13 Mar 2025, Li et al., 2024, Gregório et al., 2024, Timonina-Farkas, 4 Sep 2025, Pal et al., 2020).

1. Fundamental Principles and Mathematical Formulation

LexRank represents each sentence as a vector under a bag-of-words model, typically using tf–idf weighting. Pairwise sentence similarity, most commonly computed via (idf-modified) cosine similarity, determines the weighted edges in the undirected sentence graph. Denote document (or cluster) sentences as $s_1,\dots,s_n$ with tf–idf vectors $\mathbf{v}_i \in \mathbb{R}^V$ (where $V$ is the vocabulary size):

$(\mathbf{v}_i)_t = \textrm{tf}_{i,t} \cdot \log\frac{N}{\textrm{df}_t}$

with $N$ the total number of reference documents and $\textrm{df}_t$ the document frequency for term $t$ .

Pairwise similarities are computed as: $\textrm{sim}(s_i,s_j) = \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{\|\mathbf{v}_i\| \, \|\mathbf{v}_j\|}$

Edges are established between $(i, j)$ if $\textrm{sim}(s_i,s_j)\geq \theta$ (with $\theta$ typically $\approx 0.1$ ), forming a symmetric similarity/adjacency matrix $S$ or $\widetilde{W}$ .

LexRank applies a random walk (PageRank-style) centrality model to the resulting graph. Row-normalizing yields a stochastic transition matrix $W$ , and with damping $d$ (typically $0.85$), centrality scores solve: $\mathbf{p} = d\, W^\top \mathbf{p} + (1-d)\, \frac{\mathbf{1}}{n}$ Iteration continues until convergence in $L_1$ norm.

LexRank supports thresholded (binarized) and continuous (weighted) connectivity (see Table 1):

Variant	Edge Construction	Edge Weights
Thresholded	$\textrm{sim}\geq\theta$	$1$
Continuous	All pairs	$\textrm{sim}$

2. Algorithmic Workflow and Implementation

The canonical LexRank workflow, as implemented in the original system and in reference software (Gensim/Summa, NetworkX), comprises:

Preprocess all source documents: tokenize, normalize, stop-word removal, stemming/lemmatization.
Compute sentence-level tf–idf vectors, with idf statistics from a large reference corpus.
Form the similarity matrix $S$ using (idf-modified) cosine similarity.
For thresholded LexRank, apply threshold $\theta$ to $S$ to form the adjacency matrix (and enforce $A_{ii}=1$ ).
Normalize rows to build the stochastic transition matrix $W$ .
Set initial centrality scores $p^{(0)}=1/n$ and apply the power method:

$p^{(t+1)} = d W^\top p^{(t)} + (1-d)\frac{\mathbf{1}}{n}$

until $\|p^{(t+1)}-p^{(t)}\|_1<\varepsilon$ (e.g., $\varepsilon=10^{-6}$ ).

Extract the top-ranked sentences, enforcing length or token count limits, optionally reordered to preserve input sequence.

LexRank notably enforces self-links (or uses teleportation) to avoid dangling nodes. Redundancy is managed post hoc via filters such as Maximal Marginal Relevance (MMR).

3. Extensions, Robust Formulations, and Efficiency Improvements

Robust $L_1$ -LexRank

An $L_1$ -robust variant extends LexRank to uncertain and dynamically growing graphs by introducing column-wise $L_1$ -bounded perturbations on the transition probabilities, formulated as a min-max linear program over the stationary distribution. The convex upper-bound problem extracted from this framework is: $\min_{x\in\Delta^N}\|P x - x\|_1 + \sum_j \varepsilon_j|x_j|$ where $P$ is the fixed transition matrix and $\varepsilon_j$ are per-column adversarial budgets (Timonina-Farkas, 4 Sep 2025).

Guided LexRank (GLARE)

GLARE incorporates external, task-specific guidance (e.g., legal themes) by combining intrinsic LexRank centrality $\gamma_s$ with extrinsic similarity $\sigma_s$ (e.g., BM25 score to a theme corpus), with linear weighting: $\textrm{Score}(s) = \alpha \gamma_s + \beta \sigma_s$ where $\alpha$ and $\beta$ are tunable (Gregório et al., 2024). This increases retrieval robustness, especially in low-supervision or concept-drift settings.

FastLexRank

FastLexRank reduces the computational complexity of LexRank from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ under a fully connected graph assumption. For embedding matrix $E\in\mathbb{R}^{n\times d}$ (unit-normalized TF–IDF or sentence embeddings):

Compute $z=\sum_{i=1}^n E[i]$ , normalize $z=z/\|z\|$ .
Centrality: $c=E z$ .

This non-iterative, closed-form version yields identical rankings and enables real-time processing of very large corpora, at the cost of not thresholding weak similarities (Li et al., 2024).

4. Empirical Evaluation and Comparative Performance

LexRank was evaluated extensively on DUC 2003/2004 datasets using ROUGE metrics. ROUGE-1 recall scores for LexRank variants typically surpassed centroid-based and degree-centric methods:

Task	Centroid	Degree	LexRank (t=0.1)	Cont. LexRank
DUC 2003 Task 2	0.3624	0.3595	0.3666	0.3646
DUC 2004 Task 2	0.3670	0.3707	0.3736	0.3758
DUC 2004 Task 4a (MT)	0.3826	0.3928	0.3974	0.3963
DUC 2004 Task 4b (human)	0.4034	0.4026	0.4052	0.3966

LexRank demonstrated remarkable robustness to noisy clustering, with negligible performance drop when unrelated documents were injected (average decrease $<0.01$ in ROUGE-1) (Erkan et al., 2011).

Domain Adaptations

Music summarization: LexRank over clustered MFCC embeddings outperformed contiguous clipping approaches for Fado genre recognition tasks, yielding +1.2–1.8% absolute gains in SVM classification accuracy (e.g., from 96.2% to 97.4%) (Raposo et al., 2014).
Legal retrieval: Guided LexRank achieved recall@6 = 0.7575 and MAP@6 = 0.5345, outperforming both Elasticsearch baseline and topic-model-based extractive summaries (Gregório et al., 2024).
Patent hybrid summarization: LexRank-preprocessed extracts, when pipelined into fine-tuned BART with LoRA adapters, yielded ROUGE-1 ≈ 0.46, outperforming mT5 and LongT5 and remaining competitive with PEGASUS and BigBirdPegasus under resource-constrained adaptation (Jayatilleke et al., 13 Mar 2025).
Infodemic management: LexRank extractive summaries, embedded with Word2Vec and compared via Word Mover’s Distance, outperformed 35 alternative models by balancing true/false positive rates for guideline-to-news matching (Pal et al., 2020).

5. Generalizations and Practical Considerations

LexRank is fully unsupervised, parameter-light (primarily requiring a similarity threshold $\theta$ , damping factor $d$ , and convergence limit $\varepsilon$ ), and requires only TF–IDF statistics from a reference corpus. Its global sentence centrality is not topic-specific, but can be steered by hybridization with external guidance.

Practical notes:

Sparse graphs (lower $\theta$ ) retain efficiency and robustness, but excessive sparsity can reduce coverage.
Precomputing idf on a reference corpus yields more stable importance assignments.
Redundancy mitigation via MMR or similar is essential for non-repetitive summaries.
For very large (multi-thousand sentence) corpora, computational scalability can become critical, motivating FastLexRank or graph sparsification.

Limitations include reliance on lexical overlap, sensitivity to thresholding, lack of semantic generalization, and inability to perform abstraction or paraphase resolution within summaries (Zhang et al., 2024).

6. Research Developments and Variants

LexRank has served as the foundation for a spectrum of extensions:

$L_1$ -robust LexRank: Models stochastic uncertainty in sentence connections and graph growth, deriving a convex upper-bound LP for robust, stable centrality estimation (Timonina-Farkas, 4 Sep 2025).
GLARE/Guided LexRank: Fuses intrinsic centrality and extrinsic theme-guidance (BM25 or semantics), particularly impactful in retrieval and classification regimes with infrequent labels or concept drift (Gregório et al., 2024).
FastLexRank: Enables real-time, scalable graph centrality in fully connected (dense) graphs for social media streams and large document arrays (Li et al., 2024).
Hybrid Extractive–Abstractive: LexRank-extracted sentences serve as input bottlenecks to neural abstractive summarizers, reducing input length and distilling core content (Jayatilleke et al., 13 Mar 2025).
Cross-domain adaptation: LexRank has been adapted for audio/musical segment summarization (Raposo et al., 2014), cross-lingual summarization, and input preparation for downstream LLM or SVM classifiers (Pal et al., 2020).

7. Impact and Contemporary Significance

LexRank established a high-water mark among unsupervised, graph-based summarizers prior to the deep-learning era, consistently outperforming centroid-based and simple positional strategies in multi-document benchmarks (Erkan et al., 2011, Zhang et al., 2024). Its random-walk-based global centrality and shallow-linguistic simplicity enabled broad applicability, including extension to diverse media beyond text.

Despite the ascendancy of sequence-to-sequence neural models and LLMs for abstractive summarization, LexRank remains state-of-the-art for:

Resource-efficient extractive preprocessing in lengthy document summarization.
Unsupervised domain-adaptable summarization amidst label scarcity or concept drift.
Modular hybrid pipelines that require fast screening, deduplication, or salience estimation under tight computational budgets.

Recent research continues to integrate LexRank as either a robust extractive front end (including robust $L_1$ variants), a fast filter for large-scale graph scenarios, or as a component for guided or supervised abstraction pipelines in specialized domains (Timonina-Farkas, 4 Sep 2025, Li et al., 2024, Jayatilleke et al., 13 Mar 2025, Gregório et al., 2024). Its design principles and transition matrices underpin much of contemporary research in graph-based NLP, with ongoing relevance in scalable summarization, content curation, and centrality detection.