Longformer-Encoder-Decoder (LED) for Long Documents
- LED is a Transformer-based model that replaces standard encoder self-attention with sparse local and global mechanisms, enabling the efficient processing of thousands of tokens.
- It integrates a Longformer-style encoder with local windowed and global attention while using a standard Transformer decoder for generating abstractive summaries.
- Empirical results on arXiv data demonstrate that extending input length significantly boosts ROUGE scores, validating LED's effectiveness for long-document summarization.
The Longformer-Encoder-Decoder (LED) is a Transformer-based architecture designed to support long-document generative sequence-to-sequence tasks, notably summarization, by incorporating the Longformer’s sparse attention mechanism into the encoder. LED significantly extends the scalable input capacity of encoder–decoder architectures by mitigating the prohibitive complexity in standard self-attention with an combination of local and global attention patterns, enabling effective processing of documents containing thousands to tens of thousands of tokens. The decoder remains architecturally analogous to the standard Transformer, using full masked self-attention and standard cross-attention mechanisms. This hybridization allows LED to retain the flexibility and performance of pre-trained models like BART while substantially scaling document length capacity for sequence-to-sequence applications such as scientific paper summarization (Beltagy et al., 2020).
1. Architecture and Attention Mechanisms
LED replaces the standard Transformer's encoder self-attention with the Longformer’s sparse attention, while preserving the original decoder design. The encoder is composed of repeated layers, each integrating:
- Local windowed self-attention: Each token attends to a sliding window of size within its neighborhood.
- Global attention: Select tokens (task-specific, such as special tokens or paragraph markers) attend to, and are attended by, all positions.
The decoder follows the original Transformer specification, utilizing:
- Causal full self-attention over previously generated tokens;
- Cross-attention from each decoder position to all encoder outputs.
Diagrammatically, the architecture is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input Document (length n)
↓
Encoder (Longformer-style)
• local windowed self-attention
• global attention
• FFN, LayerNorm, residuals
(repeated L_enc times)
↓
Decoder (standard Transformer)
– masked full self-attention (length m)
– cross-attention to encoder outputs
– FFN, LayerNorm, residuals
(repeated L_dec times)
↓
Linear & Softmax → Summary Tokens |
In LED, only the encoder’s self-attention is replaced; decoder computation remains unchanged from Vaswani et al.
2. Mathematical Formulation
Let be encoder inputs, decoder inputs, window size , and the set of global token indices ().
- Encoder self-attention:
- Local: For each position , tokens in .
- Global: For positions .
The output at position is
with softmax-normalized weights using distinct projection matrices for local and for global attention.
- Decoder self-attention: Standard causal full attention:
with softmax-normalized .
- Cross-attention: Standard full attention from decoder to all encoder outputs:
with standard softmax weights.
3. Local vs. Global Attention in LED
- Local windowed attention confines each token to attending a window , minimizing redundant computations and yielding per-layer receptive fields of size , compounded across encoder depth.
- Global attention allows select tokens (e.g., the initial
<s>token for summarization, question tokens for QA tasks) to interact densely with all positions, preserving task-relevant long-range dependencies. - The overall encoder complexity becomes
as opposed to the standard transformer's . The size and selection of global tokens are task-dependent.
4. Computational Complexity and Implementation
Let be input length, be output length, be window size (), and :
| Architecture | Encoder Self-Attn | Decoder Self-Attn | Cross-Attn | Total |
|---|---|---|---|---|
| Standard Transformer | ||||
| Longformer encoder only | N/A | N/A | ||
| LED |
Typically (e.g., summaries of ~512 tokens, documents up to ~16,000 tokens), so LED significantly reduces computation on the encoder side. Implementation efficiency is achieved via chunk-based parallelization for the local+global pattern, as illustrated in the provided pseudocode, which leverages batched local windows and fused GPU operations.
5. Training Protocol and Summarization Task Details
- Initialization: LED is initialized from a pre-trained BART model by copying weights.
- Position embedding extension: BART’s learned 1,024 absolute position embeddings are extended to 16,384 by block-wise repetition (16×).
- Encoder attention: Employs window size 1,024; global attention is assigned only to the initial
<s>token. - Decoder: Retains BART’s full masked self-attention and encoder–decoder cross-attention.
- Fine-tuning: Conducted on gold-standard arXiv abstracts using teacher-forcing cross-entropy. Optimizers include Adafactor/AdamW with warm-up and linear decay schedules; decoding utilizes beam search (beam width = 4) with a maximum output of approximately 512 tokens.
6. Empirical Results: Summarization Performance on arXiv
The arXiv summarization dataset contains very long inputs (90th percentile ≈14,000 tokens), challenging existing Transformer-based methods limited to ≤4,096 input tokens. LED’s ability to scale to 16,384 input tokens enables comprehensive summarization without aggressive truncation. Reported ROUGE scores for various models are:
| Model | Input Length | R-1 | R-2 | R-L |
|---|---|---|---|---|
| Discourse-Aware | ≪4,000 | 35.80 | 11.05 | 31.80 |
| Extr-Abst-TLM | ≪4,000 | 41.62 | 14.69 | 38.03 |
| Dancer | ≪4,000 | 42.70 | 16.54 | 38.44 |
| Pegasus | ≪4,000 | 44.21 | 16.95 | 38.83 |
| LED-large | 4,096 | 44.40 | 17.94 | 39.76 |
| BigBird | 4,096 | 46.63 | 19.02 | 41.77 |
| LED-large | 16,384 | 46.63 | 19.62 | 41.83 |
Extending LED’s encoder from 4,096 to 16,384 tokens increases R-2 from 17.94 to 19.62 and R-L from 39.76 to 41.83, matching or exceeding the performance of BigBird (which was further pre-trained for summarization). This demonstrates the benefit of full document context for summarization and the efficacy of LED in scaling encoder–decoder architectures to very long inputs.
7. Significance and Implications
LED enables efficient, scalable encoder–decoder Transformer applications for long-document tasks by replacing the encoder self-attention with the Longformer's sparse local+global pattern. The architecture offers a practical solution for sequence-to-sequence models where the input length is the limiting factor—most notably, abstractive summarization of research papers and other scientific documents. LED can be initialized from large, pre-trained sequence-to-sequence models (e.g., BART) and fine-tuned on long-document corpora, achieving state-of-the-art or competitive performance at lower computational cost relative to standard architectures. This capability positions LED as a foundational tool for tasks requiring full-document context and sets a precedent for future architectural adaptations targeting extreme-length sequence modeling (Beltagy et al., 2020).