Papers
Topics
Authors
Recent
Search
2000 character limit reached

Longformer-Encoder-Decoder (LED) for Long Documents

Updated 23 February 2026
  • LED is a Transformer-based model that replaces standard encoder self-attention with sparse local and global mechanisms, enabling the efficient processing of thousands of tokens.
  • It integrates a Longformer-style encoder with local windowed and global attention while using a standard Transformer decoder for generating abstractive summaries.
  • Empirical results on arXiv data demonstrate that extending input length significantly boosts ROUGE scores, validating LED's effectiveness for long-document summarization.

The Longformer-Encoder-Decoder (LED) is a Transformer-based architecture designed to support long-document generative sequence-to-sequence tasks, notably summarization, by incorporating the Longformer’s sparse attention mechanism into the encoder. LED significantly extends the scalable input capacity of encoder–decoder architectures by mitigating the prohibitive O(n2)O(n^2) complexity in standard self-attention with an O(nw)O(nw) combination of local and global attention patterns, enabling effective processing of documents containing thousands to tens of thousands of tokens. The decoder remains architecturally analogous to the standard Transformer, using full masked self-attention and standard cross-attention mechanisms. This hybridization allows LED to retain the flexibility and performance of pre-trained models like BART while substantially scaling document length capacity for sequence-to-sequence applications such as scientific paper summarization (Beltagy et al., 2020).

1. Architecture and Attention Mechanisms

LED replaces the standard Transformer's encoder self-attention with the Longformer’s sparse attention, while preserving the original decoder design. The encoder is composed of repeated layers, each integrating:

  • Local windowed self-attention: Each token attends to a sliding window of size ww within its neighborhood.
  • Global attention: Select tokens (task-specific, such as special tokens or paragraph markers) attend to, and are attended by, all positions.

The decoder follows the original Transformer specification, utilizing:

  • Causal full self-attention over previously generated tokens;
  • Cross-attention from each decoder position to all encoder outputs.

Diagrammatically, the architecture is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input Document (length n)
        ↓
Encoder (Longformer-style)
    • local windowed self-attention
    • global attention
    • FFN, LayerNorm, residuals
(repeated L_enc times)
        ↓
Decoder (standard Transformer)
    – masked full self-attention (length m)
    – cross-attention to encoder outputs
    – FFN, LayerNorm, residuals
(repeated L_dec times)
        ↓
Linear & Softmax → Summary Tokens

In LED, only the encoder’s self-attention is replaced; decoder computation remains unchanged from Vaswani et al.

2. Mathematical Formulation

Let X∈Rn×dX\in\mathbb{R}^{n\times d} be encoder inputs, H∈Rm×dH\in\mathbb{R}^{m\times d} decoder inputs, window size ww, and G⊆{1,…,n}G\subseteq\{1,\ldots,n\} the set of global token indices (∣G∣≪n|G|\ll n).

  • Encoder self-attention:

    • Local: For each position ii, tokens in W(i)={j:∣j−i∣≤w/2}W(i) = \{j:|j - i| \leq w/2\}.
    • Global: For positions j∈Gj\in G.

    The output at position ii is

    Attnenc(X)i=∑j∈W(i)αijsVsXj+∑j∈GαijgVgXj\mathrm{Attn}_{\rm enc}(X)_i = \sum_{j\in W(i)}\alpha_{ij}^s V_s X_j + \sum_{j\in G}\alpha_{ij}^g V_g X_j

    with softmax-normalized weights αijs,αijg\alpha_{ij}^s, \alpha_{ij}^g using distinct projection matrices (Qs,Ks,Vs)(Q_s, K_s, V_s) for local and (Qg,Kg,Vg)(Q_g, K_g, V_g) for global attention.

  • Decoder self-attention: Standard causal full attention:

    Attndec(H)i=∑j=1iβijVdHj\mathrm{Attn}_{\rm dec}(H)_i = \sum_{j=1}^i \beta_{ij} V_d H_j

    with softmax-normalized βij\beta_{ij}.

  • Cross-attention: Standard full attention from decoder to all encoder outputs:

    Attncross(H,X)i=∑j=1nγijVeXj\mathrm{Attn}_{\rm cross}(H,X)_i = \sum_{j=1}^{n} \gamma_{ij} V_e X_j

    with standard softmax weights.

3. Local vs. Global Attention in LED

  • Local windowed attention confines each token to attending a window W(i)W(i), minimizing redundant computations and yielding per-layer receptive fields of size ww, compounded across encoder depth.
  • Global attention allows select tokens (e.g., the initial <s> token for summarization, question tokens for QA tasks) to interact densely with all positions, preserving task-relevant long-range dependencies.
  • The overall encoder complexity becomes

    O(nw+n∣G∣)≈O(n)O(nw + n|G|) \approx O(n)

    as opposed to the standard transformer's O(n2)O(n^2). The size and selection of global tokens are task-dependent.

4. Computational Complexity and Implementation

Let nn be input length, mm be output length, ww be window size (w≪nw \ll n), and ∣G∣≪n|G| \ll n:

Architecture Encoder Self-Attn Decoder Self-Attn Cross-Attn Total
Standard Transformer O(n2)O(n^2) O(m2)O(m^2) O(mn)O(mn) O(n2+m2+mn)O(n^2 + m^2 + mn)
Longformer encoder only O(nw+n∣G∣)O(nw + n|G|) N/A N/A O(n)O(n)
LED O(nw+n∣G∣)O(nw + n|G|) O(m2)O(m^2) O(mn)O(mn) O(nw+n∣G∣+m2+mn)O(nw + n|G| + m^2 + mn)

Typically m≪nm \ll n (e.g., summaries of ~512 tokens, documents up to ~16,000 tokens), so LED significantly reduces computation on the encoder side. Implementation efficiency is achieved via chunk-based parallelization for the local+global pattern, as illustrated in the provided pseudocode, which leverages batched local windows and fused GPU operations.

5. Training Protocol and Summarization Task Details

  • Initialization: LED is initialized from a pre-trained BART model by copying weights.
  • Position embedding extension: BART’s learned 1,024 absolute position embeddings are extended to 16,384 by block-wise repetition (16×).
  • Encoder attention: Employs window size 1,024; global attention is assigned only to the initial <s> token.
  • Decoder: Retains BART’s full masked self-attention and encoder–decoder cross-attention.
  • Fine-tuning: Conducted on gold-standard arXiv abstracts using teacher-forcing cross-entropy. Optimizers include Adafactor/AdamW with warm-up and linear decay schedules; decoding utilizes beam search (beam width = 4) with a maximum output of approximately 512 tokens.

6. Empirical Results: Summarization Performance on arXiv

The arXiv summarization dataset contains very long inputs (90th percentile ≈14,000 tokens), challenging existing Transformer-based methods limited to ≤4,096 input tokens. LED’s ability to scale to 16,384 input tokens enables comprehensive summarization without aggressive truncation. Reported ROUGE scores for various models are:

Model Input Length R-1 R-2 R-L
Discourse-Aware ≪4,000 35.80 11.05 31.80
Extr-Abst-TLM ≪4,000 41.62 14.69 38.03
Dancer ≪4,000 42.70 16.54 38.44
Pegasus ≪4,000 44.21 16.95 38.83
LED-large 4,096 44.40 17.94 39.76
BigBird 4,096 46.63 19.02 41.77
LED-large 16,384 46.63 19.62 41.83

Extending LED’s encoder from 4,096 to 16,384 tokens increases R-2 from 17.94 to 19.62 and R-L from 39.76 to 41.83, matching or exceeding the performance of BigBird (which was further pre-trained for summarization). This demonstrates the benefit of full document context for summarization and the efficacy of LED in scaling encoder–decoder architectures to very long inputs.

7. Significance and Implications

LED enables efficient, scalable encoder–decoder Transformer applications for long-document tasks by replacing the encoder self-attention with the Longformer's sparse local+global pattern. The architecture offers a practical solution for sequence-to-sequence models where the input length is the limiting factor—most notably, abstractive summarization of research papers and other scientific documents. LED can be initialized from large, pre-trained sequence-to-sequence models (e.g., BART) and fine-tuned on long-document corpora, achieving state-of-the-art or competitive performance at lower computational cost relative to standard architectures. This capability positions LED as a foundational tool for tasks requiring full-document context and sets a precedent for future architectural adaptations targeting extreme-length sequence modeling (Beltagy et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Longformer-Encoder-Decoder (LED).