Longformer Encoder-Decoder (LED)

Updated 5 November 2025

Longformer Encoder-Decoder (LED) is a sequence-to-sequence Transformer that uses sparse local and global attention to efficiently process very long inputs.
It employs a sliding window approach in the encoder to achieve linear scaling in computation, while the decoder retains standard full self-attention for coherent sequence generation.
Empirical results show LED’s effectiveness in tasks such as document summarization, speech translation, and multilingual machine translation, highlighting its scalability and practical benefits.

The Longformer Encoder-Decoder (LED) model is a sequence-to-sequence Transformer designed for efficient processing of very long inputs, extending the linear-scaling attention innovations of Longformer to encoder-decoder architectures. LED is primarily motivated by the need to address computational bottlenecks in standard Transformers, which become infeasible for long-document inputs due to the quadratic complexity of self-attention, and to generalize efficient attention to generative tasks such as document summarization, machine translation, and direct speech-to-text translation.

1. Architectural Principles and Sparse Attention

The core design of LED preserves the architectural stack of conventional encoder-decoder Transformers (e.g., BART, T5), but with a fundamental modification: the encoder replaces full self-attention with Longformer's sparse local+global attention patterns. In LED, the encoder applies sliding window (local) attention of window size $w$ per token, restricting each token's context to its $w/2$ neighbors on either side, with selected positions (such as task-specific or special tokens) assigned global (full-sequence) attention. The decoder operates as a standard left-to-right autoregressive Transformer, retaining full self-attention over the output and cross-attention over the encoder's representations.

Sparse attention modifies the standard attention computation

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

by decomposing $Q, K, V$ into local and global projections ( $Q_s, K_s, V_s$ and $Q_g, K_g, V_g$ ), configured such that sparse windowed self-attention scales as $O(nw + ng)$ : $n$ is the input length, $w$ the window, and $g$ the number of global tokens. This ensures encoder self-attention cost is linear in sequence length; the decoder remains O( $m^2$ ) in output length $m$ , but $m \ll n$ in long-document applications.

2. Sequence-to-Sequence Learning with LED

LED enables long-sequence generative modeling in a transfer learning context, initializing from pretrained seq2seq weights (e.g., BART) and extending positional embeddings up to 16K tokens by reusing base model embeddings. Supervised finetuning then adapts the model to downstream generative tasks without further architecture changes. Unlike BigBird, which requires specialized pretraining, LED achieves strong results with lightweight initialization and standard supervised learning.

LED’s efficient attention architecture also extends beyond text: in direct speech translation (Alastruey et al., 2021), the Longformer encoder processes high-resolution mel-spectrograms directly, eliminating the need for lossy pre-encoder convolutional downsampling. Experiments demonstrated that with windowed self-attention, LED processes long audio sequences with computational feasibility and accuracy closely matching conventional approaches that use convolution for subsampling.

3. Empirical Results in Long-Document Modeling

On arXiv summarization (Beltagy et al., 2020), LED demonstrates competitive or state-of-the-art performance. LED-large with a 16K token input achieves ROUGE-1/2/L scores of 46.63/19.62/41.83, outperforming or matching baseline models such as Pegasus and BigBird—despite the absence of further domain-specific pretraining. Increasing input length correlates with significant improvements, emphasizing the benefit of modeling maximal context.

In direct speech translation (Alastruey et al., 2021), LED-based models approached the results of reference systems relying on convolutional dimensionality reduction, with an ST BLEU lag of 1.8–2.1 points and ASR WER lag of 1.4–1.8 points. Post-encoder convolutional reduction (stride-2 convolution after encoding, rather than before) provided minimal improvement, indicating the architecture can effectively align raw, long input sequences and shorter outputs.

For multilingual machine translation, a comparative study indicates that encoder-decoder architectures with LED-style attention (including mT5, IndicBART) surpass decoder-only models (e.g., XLNet, LLaMA-2) on BLEU and chrF metrics for Indian languages in both few-shot and finetuned settings (M. et al., 12 Sep 2024). Encoder-decoder models demonstrate clear advantages in variable-length mapping, bidirectional context integration, and multi-lingual generalization.

4. Advances in Attention Patterns, Scaling, and Efficiency

LED fundamentally improves computational efficiency by reducing encoder attention complexity from $O(n^2)$ to $O(n w + n g)$ , where the number of global positions $g$ is typically static and negligible. This linear scaling enables processing of sequences up to 16,384 tokens on standard GPUs.

Further efficiency can be gained by addressing encoder-decoder attention. Sparse sentence selection, as proposed by (Manakul et al., 2021), dynamically identifies a reduced set of salient input sentences to restrict encoder-decoder attention per decoding step. This modification reduces inference complexity from $O(M N_1 N_2)$ (with $M$ the output length, $N_1$ input sentences, $N_2$ sentence length) to $O(M r N_2)$ , with negligible degradation in summarization quality for sufficiently large $r$ , as evidenced empirically on CNN/DailyMail, XSum, Podcast, and arXiv.

5. Interpretability and Information Flow

DecoderLens (Langedijk et al., 2023) introduces a methodology to interpret encoder-decoder models such as LED by running the decoder on intermediate encoder layer outputs (rather than only final layer outputs), enabling investigation of information localization and processing stages. For summarization, translation, and speech processing, empirical evidence reveals that intermediate encoder layers often suffice for capturing local facts, easier translation pairs, and basic transcription, whereas deeper layers accumulate and integrate complex global information and abstraction required for challenging tasks.

Such interpretability studies indicate that LED and related architectures perform progressive information refinement: early encoder layers encode local or surface properties, intermediate layers synthesize phrase or entity-level information, and top layers perform complex integration—for example, abstract summary generation or many-to-many language mapping.

6. Scaling Laws, Downstream Performance, and Architectural Tradeoffs

Recent comparative studies (Zhang et al., 30 Oct 2025) find that encoder-decoder models (e.g., RedLLM, LED-style architectures) scale competitively to decoder-only models, especially after instruction tuning or task-specific adaptation. While decoder-only models are more parameter-efficient for pretraining (lower PPL at a fixed parameter count), encoder-decoder models provide comparable scaling exponents, smoother context-length generalization due to cross-attention, and dominate the inference efficiency Pareto frontier after downstream tuning. Inference efficiency is particularly relevant for applications involving extremely long inputs (e.g., document summarization, code generation, long-form translation).

LED’s modular architecture—swap-in for any encoder-decoder task—and its compatibility with mixed-attention sparse patterns renders it a suitable baseline for new efficient, long-sequence seq2seq research, including streaming self-attention for online or real-time generation scenarios (M. et al., 12 Sep 2024).

7. Limitations and Future Directions

LED’s sparse patterns are highly performant for tasks with strong locality or hierarchical input structure. However, certain scenarios—such as tasks with global dependencies not captured by selected global tokens, or datasets where salient positions are not easily identified a priori—may result in information leakage or reduced performance. Empirical results show that performance does not strictly improve with larger window sizes—large windows can decrease stability for speech tasks (Alastruey et al., 2021).

Ongoing research directions include the exploration of alternative scalable attention mechanisms (e.g., block-sparse, low-rank linear transforms), imbalance in encoder/decoder depth or capacity, automatic global token selection, improved initialization regimes, and task-specific adaptation of sparse encoder-decoder attention (Zhang et al., 30 Oct 2025, Manakul et al., 2021). Practical improvements include dynamic sentence selection, further pretraining of LED for new domains, and systematic efficiency benchmarking.

Dimension	LED Approach	Key Effect
Encoder self-attn	Sliding window + global tokens	Linear scaling, feasible for long sequences
Encoder-decoder attn	Full (baseline) / sparse (opt.)	Lowered inference cost, maintained quality
Interpretability	DecoderLens, attention maps	Layerwise task evolution, information flow
Initialization	Pretrained BART reused	Fast adaptation to long input tasks
Empirical performance	State-of-the-art/arXiv summar.	Maximizes benefit of input length

LED introduces a scalable, modular, and empirically validated framework for long sequence-to-sequence modeling, enabling research and deployment on tasks with previously prohibitive input lengths. Its efficient attention architecture, strong empirical benchmarks, and compatibility with transfer learning position it as a reference architecture for further innovations in efficient, long-context generative modeling and sequence transduction (Beltagy et al., 2020, Alastruey et al., 2021, M. et al., 12 Sep 2024, Manakul et al., 2021, Zhang et al., 30 Oct 2025, Langedijk et al., 2023).