Longformer-based Content Extractor

Updated 10 February 2026

The paper introduces a Longformer-based extractor that leverages sparse local and global attention to efficiently process long documents for extraction and summarization.
It details an encoder-decoder framework, combining Longformer with BART-style decoders, which achieves superior ROUGE metrics and high expert ratings on medical corpora.
The study also examines challenges in token-level rationale extraction and proposes hybrid architectures to improve precision and reduce redundancy.

A Longformer-based content extractor is a neural architecture designed to perform information extraction, rationale identification, and summarization on long-form documents using Longformer’s sparse attention mechanism. These extractors are especially effective with lengthy inputs typical in domains such as healthcare, academic publishing, and document-level natural language processing, where standard transformers are limited by quadratic scaling in self-attention.

1. Longformer Sparse Attention: Foundations and Variants

The Longformer architecture introduces a linear-scaling self-attention mechanism with two core components: local (sliding-window) and global attention. Unlike standard transformers, which compute dense attention between all token pairs at $O(n^2)$ cost, Longformer restricts most tokens to attend to their $w$ -neighborhood, while a select global set communicates with the entire sequence. Typical settings use window size $w=256$ –$512$, with global tokens comprising $1$–$2$\% of the sequence (often special markers such as [CLS] or section headers). The encoder layer computes, for input $X \in \mathbb{R}^{n \times d}$ :

Local attention:

$A_{ij}^{\text{local}} = \begin{cases} \text{softmax}_{j \in N_i} \left( \frac{Q_i K_j^T}{\sqrt{d_k}} \right), & j \in N_i \ 0, & \text{otherwise} \end{cases}$

where $N_i$ is the window of tokens around $i$ .

Global attention:

$A_{ij}^{\text{global}} = \begin{cases} \text{softmax}_{j=1}^n \left( \frac{Q_i K_j^T}{\sqrt{d_k}} \right), & i \in G \ 0, & \text{otherwise} \end{cases}$

Implementation involves building a sparse $n \times n$ attention mask and combining outputs accordingly. This mechanism allows Longformer-based models to efficiently encode documents up to $16$K tokens for extractive or generative purposes (Beltagy et al., 2020).

2. Longformer-Based Summarization Frameworks

For summarization, the typical workflow employs an encoder–decoder architecture with Longformer-based encoders and BART-style (auto-regressive) decoders. The encoder processes the document $X = \{x_1, ..., x_n\}$ with input embeddings $h_i^{(0)} = E(x_i) + P_i$ (learned positional embeddings), stacking $L$ Longformer layers to produce $H = \{h_1, ..., h_n\}$ . The decoder attends to $H$ to generate summaries:

$L = -\sum_{t=1}^m \log p(y_t \mid y_{<t}, X)$

where $Y = \{y_1, ..., y_m\}$ is the target summary. Fine-tuning typically involves standard settings: AdamW optimizer (lr= $3\times10^{-5}$ ), batch size 4 per GPU, up to 5 epochs with early stopping on validation ROUGE, max input length $n=512$ , and gradient clipping at $C=1.0$ (Sun et al., 10 Mar 2025).

Longformer-based summarizers outperform RNN, standard Transformer, BERT, and T5 baselines in both ROUGE and expert evaluations on medical corpora, achieving ROUGE-1/2/L of 0.72/0.61/0.70 for Longformer, compared to 0.69/0.57/0.66 for T5 and 0.66/0.54/0.64 for BERT. Experts rate the summaries highly on information retention (4.8/5) and grammar (4.9/5), while conciseness (4.0/5) remains an area for improvement (Sun et al., 10 Mar 2025).

3. Extractive Content Selection and Rationale Extraction

In extractive paradigms, the Longformer’s global attention slots (e.g., [CLS] token) are leveraged to aggregate document information. The self-attention matrix $A \in \mathbb{R}^{(N+1)\times(N+1)}$ contains token-level attention weights from [CLS] to input positions ( $\alpha_i = A[0,i]$ ). Top- $k\%$ tokens by $\alpha_i$ can be selected as rationales. However, for documents exceeding $1000$ tokens, these attention weights become diffuse $\left(\sum_i \alpha_i=1\right)$ , yielding poor token-level F1 (e.g., 14.75 on BEA-2019, often no better than random) (Bujel et al., 2023).

Weighted soft attention modules, when applied to Longformer embeddings, face similar issues—most tokens are assigned high scores, leading to high recall but poor precision. Only token pairs with maximal/minimal scores receive direct supervision per epoch, resulting in lackluster token-level extraction (token F1 $\sim$ 21.22 on BEA-2019, with F $_{0.5}$ and MAP even lower) (Bujel et al., 2023).

4. Architectures and Implementation Strategies

The base Longformer model is typically a 12-layer encoder (hidden size 768, 12 heads, head dim 64), with local window $w=256$ –$512$ and extensible position embeddings (copied from RoBERTa for longer context). In encoder-decoder settings—such as LED for summarization—architectures mirror BART (e.g., 6-layer base, 12-layer large variants). The decoder employs standard (dense) self-attention and encoder-decoder cross-attention (Beltagy et al., 2020).

For global attention assignment, it is recommended to mark tokens critical for document-wide communication (e.g., questions, candidate answers, [CLS], section headers). In deployment, maintaining separation between global and local projections and extending positional embeddings are crucial for scalability. The LED variant enables summarization for inputs up to 16K tokens, as validated on the arXiv summarization dataset (Beltagy et al., 2020).

5. Empirical Performance and Considerations

Longformer-based extractors and summarizers have demonstrated strong performance over a diverse set of long document tasks. Summarization results:

Model	ROUGE-1	ROUGE-2	ROUGE-L
RNN	0.47	0.35	0.44
Transformer	0.54	0.41	0.52
BERT	0.66	0.54	0.64
T5	0.69	0.57	0.66
LongFormer	0.72	0.61	0.70

Clinical expert evaluation (on a 1–5 scale): Conciseness (4.0), Information retention (4.8), Readability (4.3), Grammar (4.9). Extractive variants, however, may fail to provide plausible rationales for document-level tasks due to the diffusion of importance scores, a major empirical finding in unsupervised rationale extraction (Bujel et al., 2023).

On extractive QA and classification, Longformer-base outperforms RoBERTa-base in tasks such as WikiHop, TriviaQA, Hyperpartisan, and IMDB, demonstrating benefits for document-level prediction (Beltagy et al., 2020). However, for fine-grained token-level rationales, compositional soft attention (using sentence-wise RoBERTa) can surpass Longformer in both precision and computation time.

Current Longformer-based extractors face two principal challenges:

Redundant content in summaries and lack of conciseness, with observed repetition of background or extraneous clauses.
Diffuse attention weights in extractive settings, resulting in token-level rationales with poor alignment to human explanations, especially on lengthy documents.

Proposed improvements include introducing architectural redundancy penalties (coverage losses), gating mechanisms to suppress duplicate content, and reinforcement learning with rewards penalizing redundancy. Data-level interventions, such as including diverse summarization examples and more nuanced global token selection, may also enhance extractor fidelity (Sun et al., 10 Mar 2025).

For rationale extraction, compositional soft attention architectures—with sentence-wise RoBERTa encoding followed by global soft-attention—outperform direct Longformer-based rationalizing modules, achieving higher token-level F1 and mean average precision, and shortening training epochs by 5–15% (Bujel et al., 2023). This suggests that hybrid architectures combining local and global attention merits further exploration.

7. Applications and Deployment Considerations

Longformer-based content extractors are widely used in automated medical summarization, extractive question answering, long-document classification, and unsupervised rationale extraction. In clinical and scientific document summarization, Longformer-based models provide high information retention and grammatical accuracy, though conciseness and readability remain targets for future improvement. For general content extraction and QA, the approach’s linear scaling makes it uniquely suited for long-context scenarios unavailable to standard transformers.

Practical deployment requires attention to memory/cost trade-offs: sliding-window operations scale as $O(nw)$ , with custom CUDA implementations advised for efficiency. Correct longitudinal extension of position embeddings, care in global token assignment, mixed-precision handling, and potential for gradient accumulation are central to robust, large-scale applications (Beltagy et al., 2020).

Together, these models and strategies comprise an accurate, scalable framework for extracting and summarizing content from long documents, with ongoing research directed toward enhanced conciseness, token-level rationale fidelity, and domain-specific deployment (Sun et al., 10 Mar 2025, Beltagy et al., 2020, Bujel et al., 2023).

Markdown Upgrade to Chat

References (3)

Longformer: The Long-Document Transformer (2020)

A LongFormer-Based Framework for Accurate and Efficient Medical Text Summarization (2025)

Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Longformer-based Content Extractor.