Truncated Self-Attention in Transformers

Updated 19 March 2026

Truncated self-attention is a modification in Transformer models that limits the set of attended keys to reduce quadratic computation and memory costs.
It utilizes techniques like fixed windowing, n‑gram masks, low-rank estimation, and dilated contexts to efficiently capture long-range dependencies.
This approach is applied in streaming ASR, machine translation, and energy-efficient hardware, achieving significant speedup and minimal accuracy loss.

Truncated self-attention encompasses a family of modifications to the canonical self-attention mechanism in Transformer models wherein the set of attended keys for each query is explicitly restricted, masked, or approximated. The principal motivation is to mitigate the quadratic computational and memory complexity of full self-attention ( $O(T^2 d)$ for sequence length $T$ , feature dimension $d$ ) without significantly degrading model accuracy. Techniques for truncation include fixed windowing, $N$ -gram masks, low-rank or partial approximations, learned runtime pruning, and dilated context. These methods have been validated across automatic speech recognition (ASR), machine translation, and language modeling, yielding streamable, memory- and energy-efficient architectures and hardware accelerators with minimal accuracy loss.

1. Rationales and Taxonomy of Truncation Methods

Truncated self-attention targets two major resource constraints: inference/decoding latency for long sequences, especially in streaming/online settings, and hardware limitations for high-throughput or energy-efficient deployment. Truncation is deployed using several key algorithmic paradigms:

Local/Windowed Attention: Only a fixed span $[t-L, t+R]$ around each query $t$ is attended (e.g., $L=32$ , $R=4$ for streaming ASR) (Yeh et al., 2019).
$N$ -gram Masked Self-Attention: At each decoding step $i$ , attention is masked to only the $T$ 0 prior tokens ( $T$ 1) (Chelba et al., 2020).
Low-rank/Partial Score Estimation: Only a small subset of the attention score matrix is computed exactly, with the remainder reconstructed via learned statistical estimators exploiting observed low-rank structure (Bhojanapalli et al., 2021).
Dilated/Hierarchical Contexts: Local windows are augmented with summaries (e.g., via mean-pooling, attention-based pooling) of disjoint, non-overlapping, or subsampled groups, allowing coverage of distant context at low resolution (Moritz et al., 2021).
Learned Runtime Pruning: Adaptive, per-layer, learned score thresholds prune away inessential attention connections selectively at inference, implemented in both software and specialized hardware (Li et al., 2022).

The principal distinction lies in whether truncation is hard (masking), soft (statistical approximation), or learned on the fly, and whether the induction is uniform, data-driven, or dynamic.

2. Mathematical Formulations and Algorithms

2.1 Local and $T$ 2-gram Masked Attention

For a single attention head at position $T$ 3, local attention is realized as: $T$ 4

$T$ 5

Truncated $T$ 6-gram masking uses a binary mask $T$ 7: $T$ 8 applied additively (as $T$ 9 for masked positions) within the softmax (Chelba et al., 2020).

2.2 Statistical Reconstruction via Low-rank Estimation

Given $d$ 0 the attention score matrix and a selected subset $d$ 1 of entries, the remainder are reconstructed as

$d$ 2

where $d$ 3 is the population covariance over score matrices empirically observed to be low-rank, facilitating highly accurate recovery from small $d$ 4 (Bhojanapalli et al., 2021).

2.3 Dilated (Multi-Resolution) Self-Attention

For input $d$ 5, queries attend to local neighborhoods of width $d$ 6 and to chunk-wise summaries of distant regions, with summarization via subsampling, mean, or attention pooling. Final attended keys and values comprise both sets and softmax is computed over their concatenation (Moritz et al., 2021).

3. Algorithmic, Implementation, and Architectural Aspects

3.1 Streaming and Buffering

Local and $d$ 7-gram truncation methods require only fixed-size FIFO buffers per Transformer layer (e.g., $d$ 8 for local, $d$ 9 for $N$ 0-gram), decoupling state from total sequence length. This is crucial for streaming ASR or real-time inference (Yeh et al., 2019, Chelba et al., 2020).

3.2 Computational Complexity

Key operational trade-offs:

Method	Complexity (per layer)	Notes
Full self-attention	$N$ 1	All pairs
Local window	$N$ 2	$N$ 3 fixed
N-gram mask	$N$ 4	$N$ 5
Low-rank/statistical	$N$ 6	$N$ 7
Dilated attention	$N$ 8	$N$ 9: chunk size
Runtime pruning	$[t-L, t+R]$ 0 (but many terms skipped)	(Li et al., 2022)

The $[t-L, t+R]$ 1 or $[t-L, t+R]$ 2 scaling of window-based and $[t-L, t+R]$ 3-gram approaches, and the linear-in- $[t-L, t+R]$ 4 scaling of statistical methods, enable accelerators and mobile-class deployment.

3.3 Hardware and Energy-Efficient Designs

LeOPArd, a specialized accelerator, dynamically prunes attention using learned thresholds, bit-serial dot-product computation, and early termination. This setup achieves $[t-L, t+R]$ 5 to $[t-L, t+R]$ 6 speedup and $[t-L, t+R]$ 7– $[t-L, t+R]$ 8 energy savings at $[t-L, t+R]$ 9 accuracy loss, validated across 43 tasks and models (MemN2N, BERT, GPT-2, ViT) (Li et al., 2022).

4. Hyperparameter Choices, Trade-Offs, and Empirical Findings

4.1 Window and Mask Sizes

Local window: $t$ 0, $t$ 1 achieves $t$ 2 relative WER loss on LibriSpeech test-clean but enables streamable, constant-memory inference (Yeh et al., 2019).
N-gram: $t$ 3 delivers $t$ 4– $t$ 5 reduction in FLOPs and memory with $t$ 6 BLEU loss, with smaller $t$ 7 incurring sharper degradation (Chelba et al., 2020).
Dilated: $t$ 8– $t$ 9, $L=32$ 0– $L=32$ 1 suffice to match or outperform full attention in ASR with only $L=32$ 2– $L=32$ 3 of the compute (Moritz et al., 2021).
Low-rank selection: $L=32$ 4– $L=32$ 5 gives $L=32$ 6 compute savings with $L=32$ 7 accuracy loss for BERT-Base on MNLI and MLM (Bhojanapalli et al., 2021).

4.2 Empirical Performance

Representative results and trade-offs:

Method	Task	Config	Accuracy Loss	Cost Reduction
Local windowed	LibriSpeech ASR	$L=32$ 8, $L=32$ 9	$R=4$ 0 vs $R=4$ 1 WER	$R=4$ 2
N-gram masking	WMT'14 En-Fr	$R=4$ 3	$R=4$ 4– $R=4$ 5 BLEU drop	$R=4$ 6- $R=4$ 7 FLOPs
Dilated	LibriSpeech ASR	AP-2+PP, $R=4$ 8	$R=4$ 9 vs $N$ 0 WER	$N$ 1M vs $N$ 2M mults
Runtime pruning	BERT/GLUE	LeOPArd	$N$ 3	$N$ 4– $N$ 5 speed

5. Modeling and Theoretical Insights

Truncated self-attention leverages the empirical observation that attention score matrices are low-rank, with most variance captured by a small number of principal components (Bhojanapalli et al., 2021). Windowed and masked attention composes long-range dependencies via layer stacking: even with an $N$ 6-width restriction per layer, stacking $N$ 7 layers produces an effective receptive field of $N$ 8. Dilated and hierarchical variants preserve global context at low resolution, trading away detail at long range for efficiency. Runtime pruning schemes leverage the sparsity of large-magnitude scores and optimize prune thresholds as learnable parameters.

6. Practical Deployment and Application Scenarios

Truncated self-attention is deployed in:

Streaming Speech Recognition: Windowed ( $N$ 9/ $i$ 0) or dilated attention is mandatory for online, causal decoding (Yeh et al., 2019, Moritz et al., 2021).
Low-latency Machine Translation/Decoding: $i$ 1-gram mask enables batched, high-throughput decoding with minimal context and FLOP cost (Chelba et al., 2020).
Pretrained LLMs: Statistical reconstruction yields 40% FLOP savings in BERT/MNLI/MLM with $i$ 2 loss (Bhojanapalli et al., 2021).
Energy- and Area-constrained Hardware: Learned pruning enables throughput/energy scaling of $i$ 3– $i$ 4 with negligible quality loss (Li et al., 2022).

A plausible implication is that as sequence lengths scale to thousands or more, such truncation schemes become not only desirable but necessary for practical deployment in both edge and datacenter regimes.

7. Limitations and Future Directions

While truncated self-attention methods preserve most model accuracy, residual degradation is observed for very small windows or aggressive pruning (BLEU and WER drop-offs for $i$ 5 or tiny windows). For tasks requiring fine-grained long-range dependencies, multi-resolution and hybrid approaches (window+dilation, or statistical+pruning) offer superior trade-offs (Moritz et al., 2021). Theoretical characterization of trade-offs and their effect on interpretability and downstream generalization remains open. Extensions to vision, graph, and multimodal transformers are active areas of research, along with further optimization of runtime scheduling and adaptive windowing.

References:

(Yeh et al., 2019, Chelba et al., 2020, Bhojanapalli et al., 2021, Moritz et al., 2021, Li et al., 2022)

Markdown Report Issue Upgrade to Chat

References (5)

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention (2019)

Faster Transformer Decoding: N-gram Masked Self-Attention (2020)

Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation (2021)

Capturing Multi-Resolution Context by Dilated Self-Attention (2021)

Accelerating Attention through Gradient-Based Learned Runtime Pruning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Truncated Self-Attention.

Truncated Self-Attention in Transformers

1. Rationales and Taxonomy of Truncation Methods

2. Mathematical Formulations and Algorithms

2.1 Local and $T$ 2-gram Masked Attention

2.2 Statistical Reconstruction via Low-rank Estimation

2.3 Dilated (Multi-Resolution) Self-Attention

3. Algorithmic, Implementation, and Architectural Aspects

3.1 Streaming and Buffering

3.2 Computational Complexity

3.3 Hardware and Energy-Efficient Designs

4. Hyperparameter Choices, Trade-Offs, and Empirical Findings

4.1 Window and Mask Sizes

4.2 Empirical Performance

5. Modeling and Theoretical Insights

6. Practical Deployment and Application Scenarios

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Truncated Self-Attention in Transformers

1. Rationales and Taxonomy of Truncation Methods

2. Mathematical Formulations and Algorithms

2.1 Local and TTT2-gram Masked Attention

2.2 Statistical Reconstruction via Low-rank Estimation

2.3 Dilated (Multi-Resolution) Self-Attention

3. Algorithmic, Implementation, and Architectural Aspects

3.1 Streaming and Buffering

3.2 Computational Complexity

3.3 Hardware and Energy-Efficient Designs

4. Hyperparameter Choices, Trade-Offs, and Empirical Findings

4.1 Window and Mask Sizes

4.2 Empirical Performance

5. Modeling and Theoretical Insights

6. Practical Deployment and Application Scenarios

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2.1 Local and $T$ 2-gram Masked Attention