Papers
Topics
Authors
Recent
Search
2000 character limit reached

Truncated Self-Attention in Transformers

Updated 19 March 2026
  • Truncated self-attention is a modification in Transformer models that limits the set of attended keys to reduce quadratic computation and memory costs.
  • It utilizes techniques like fixed windowing, n‑gram masks, low-rank estimation, and dilated contexts to efficiently capture long-range dependencies.
  • This approach is applied in streaming ASR, machine translation, and energy-efficient hardware, achieving significant speedup and minimal accuracy loss.

Truncated self-attention encompasses a family of modifications to the canonical self-attention mechanism in Transformer models wherein the set of attended keys for each query is explicitly restricted, masked, or approximated. The principal motivation is to mitigate the quadratic computational and memory complexity of full self-attention (O(T2d)O(T^2 d) for sequence length TT, feature dimension dd) without significantly degrading model accuracy. Techniques for truncation include fixed windowing, NN-gram masks, low-rank or partial approximations, learned runtime pruning, and dilated context. These methods have been validated across automatic speech recognition (ASR), machine translation, and language modeling, yielding streamable, memory- and energy-efficient architectures and hardware accelerators with minimal accuracy loss.

1. Rationales and Taxonomy of Truncation Methods

Truncated self-attention targets two major resource constraints: inference/decoding latency for long sequences, especially in streaming/online settings, and hardware limitations for high-throughput or energy-efficient deployment. Truncation is deployed using several key algorithmic paradigms:

  • Local/Windowed Attention: Only a fixed span [tL,t+R][t-L, t+R] around each query tt is attended (e.g., L=32L=32, R=4R=4 for streaming ASR) (Yeh et al., 2019).
  • NN-gram Masked Self-Attention: At each decoding step ii, attention is masked to only the N1N-1 prior tokens (j[iN+1,i1]j \in [i-N+1, i-1]) (Chelba et al., 2020).
  • Low-rank/Partial Score Estimation: Only a small subset of the attention score matrix is computed exactly, with the remainder reconstructed via learned statistical estimators exploiting observed low-rank structure (Bhojanapalli et al., 2021).
  • Dilated/Hierarchical Contexts: Local windows are augmented with summaries (e.g., via mean-pooling, attention-based pooling) of disjoint, non-overlapping, or subsampled groups, allowing coverage of distant context at low resolution (Moritz et al., 2021).
  • Learned Runtime Pruning: Adaptive, per-layer, learned score thresholds prune away inessential attention connections selectively at inference, implemented in both software and specialized hardware (Li et al., 2022).

The principal distinction lies in whether truncation is hard (masking), soft (statistical approximation), or learned on the fly, and whether the induction is uniform, data-driven, or dynamic.

2. Mathematical Formulations and Algorithms

2.1 Local and NN-gram Masked Attention

For a single attention head at position tt, local attention is realized as: αt,s=softmaxs=tLt+R(QtKsdk)\alpha_{t,s} = \operatorname{softmax}_{s'=t-L}^{t+R}\left( \frac{Q_t \cdot K_{s'}^\top}{\sqrt{d_k}} \right)

ht=s=tLt+Rαt,sVsh_t = \sum_{s=t-L}^{t+R} \alpha_{t,s} V_s

Truncated NN-gram masking uses a binary mask M{0,1}T×TM \in \{0,1\}^{T \times T}: Mi,j={1if 0<ij<N 0otherwiseM_{i,j} = \begin{cases} 1 & \text{if } 0 < i-j < N \ 0 & \text{otherwise} \end{cases} applied additively (as -\infty for masked positions) within the softmax (Chelba et al., 2020).

2.2 Statistical Reconstruction via Low-rank Estimation

Given SRn×nS \in \mathbb{R}^{n\times n} the attention score matrix and a selected subset PP of entries, the remainder are reconstructed as

a^Pˉ=ΣPˉPΣPP1aP\hat{a}_{\bar{P}} = \Sigma_{\bar{P}P} \Sigma_{PP}^{-1} a_{P}

where Σ\Sigma is the population covariance over score matrices empirically observed to be low-rank, facilitating highly accurate recovery from small knk \ll n (Bhojanapalli et al., 2021).

2.3 Dilated (Multi-Resolution) Self-Attention

For input X=(x1,...,xN)X = (x_1, ..., x_N), queries attend to local neighborhoods of width WW and to chunk-wise summaries of distant regions, with summarization via subsampling, mean, or attention pooling. Final attended keys and values comprise both sets and softmax is computed over their concatenation (Moritz et al., 2021).

3. Algorithmic, Implementation, and Architectural Aspects

3.1 Streaming and Buffering

Local and NN-gram truncation methods require only fixed-size FIFO buffers per Transformer layer (e.g., L+RL+R for local, NN for NN-gram), decoupling state from total sequence length. This is crucial for streaming ASR or real-time inference (Yeh et al., 2019, Chelba et al., 2020).

3.2 Computational Complexity

Key operational trade-offs:

Method Complexity (per layer) Notes
Full self-attention O(T2d)O(T^2 d) All pairs
Local window O(TdW)O(T d W) W=L+RW = L+R fixed
N-gram mask O(NTd)O(N T d) NTN \ll T
Low-rank/statistical O(nkd)+O(n2k)O(n k d) + O(n^2 k) knk \ll n
Dilated attention O(Nd(W+N/M))O(Nd(W+N/M)) MM: chunk size
Runtime pruning O(T2d)O(T^2 d) (but many terms skipped) (Li et al., 2022)

The O(T)O(T) or O(NT)O(N T) scaling of window-based and NN-gram approaches, and the linear-in-kk scaling of statistical methods, enable accelerators and mobile-class deployment.

3.3 Hardware and Energy-Efficient Designs

LeOPArd, a specialized accelerator, dynamically prunes attention using learned thresholds, bit-serial dot-product computation, and early termination. This setup achieves 2.6×2.6\times to 3.5×3.5\times speedup and 5.2×5.2\times6.0×6.0\times energy savings at <0.2%<0.2\% accuracy loss, validated across 43 tasks and models (MemN2N, BERT, GPT-2, ViT) (Li et al., 2022).

4. Hyperparameter Choices, Trade-Offs, and Empirical Findings

4.1 Window and Mask Sizes

  • Local window: L=32L=32, R=4R=4 achieves 4.7%4.7\% relative WER loss on LibriSpeech test-clean but enables streamable, constant-memory inference (Yeh et al., 2019).
  • N-gram: N[6,10]N\in[6,10] delivers $2$–3×3\times reduction in FLOPs and memory with <0.5<0.5 BLEU loss, with smaller NN incurring sharper degradation (Chelba et al., 2020).
  • Dilated: W15W \approx 15–$25$, M15M \approx 15–$30$ suffice to match or outperform full attention in ASR with only $15$–20%20\% of the compute (Moritz et al., 2021).
  • Low-rank selection: k=24k=24–$32$ gives >40%>40\% compute savings with <2%<2\% accuracy loss for BERT-Base on MNLI and MLM (Bhojanapalli et al., 2021).

4.2 Empirical Performance

Representative results and trade-offs:

Method Task Config Accuracy Loss Cost Reduction
Local windowed LibriSpeech ASR L=32L=32, R=4R=4 6.37%6.37\% vs 6.08%6.08\% WER O(T)O(T)
N-gram masking WMT'14 En-Fr N=8N=8 $0.3$–$0.4$ BLEU drop $2$-3×3\times FLOPs
Dilated LibriSpeech ASR AP-2+PP, W=25W=25 $2.4$ vs $2.6$ WER $7.6$M vs $52$M mults
Runtime pruning BERT/GLUE LeOPArd <0.2%<0.2\% $2.6$–3.5×3.5\times speed

5. Modeling and Theoretical Insights

Truncated self-attention leverages the empirical observation that attention score matrices are low-rank, with most variance captured by a small number of principal components (Bhojanapalli et al., 2021). Windowed and masked attention composes long-range dependencies via layer stacking: even with an NN-width restriction per layer, stacking LL layers produces an effective receptive field of 1+L(N2)1 + L \cdot (N-2). Dilated and hierarchical variants preserve global context at low resolution, trading away detail at long range for efficiency. Runtime pruning schemes leverage the sparsity of large-magnitude scores and optimize prune thresholds as learnable parameters.

6. Practical Deployment and Application Scenarios

Truncated self-attention is deployed in:

  • Streaming Speech Recognition: Windowed (LL/RR) or dilated attention is mandatory for online, causal decoding (Yeh et al., 2019, Moritz et al., 2021).
  • Low-latency Machine Translation/Decoding: NN-gram mask enables batched, high-throughput decoding with minimal context and FLOP cost (Chelba et al., 2020).
  • Pretrained LLMs: Statistical reconstruction yields 40% FLOP savings in BERT/MNLI/MLM with <2%<2\% loss (Bhojanapalli et al., 2021).
  • Energy- and Area-constrained Hardware: Learned pruning enables throughput/energy scaling of $2$–6×6\times with negligible quality loss (Li et al., 2022).

A plausible implication is that as sequence lengths scale to thousands or more, such truncation schemes become not only desirable but necessary for practical deployment in both edge and datacenter regimes.

7. Limitations and Future Directions

While truncated self-attention methods preserve most model accuracy, residual degradation is observed for very small windows or aggressive pruning (BLEU and WER drop-offs for N<4N < 4 or tiny windows). For tasks requiring fine-grained long-range dependencies, multi-resolution and hybrid approaches (window+dilation, or statistical+pruning) offer superior trade-offs (Moritz et al., 2021). Theoretical characterization of trade-offs and their effect on interpretability and downstream generalization remains open. Extensions to vision, graph, and multimodal transformers are active areas of research, along with further optimization of runtime scheduling and adaptive windowing.

References:

(Yeh et al., 2019, Chelba et al., 2020, Bhojanapalli et al., 2021, Moritz et al., 2021, Li et al., 2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Truncated Self-Attention.