Restricted Self-Attention

Updated 7 May 2026

Restricted self-attention is a technique that limits a token's context to local or sparse regions using windowed, dilated, or exclusion masks.
It reduces the quadratic complexity of standard self-attention to near-linear scales, achieving up to 7× computational speedups with minimal accuracy loss in applications like speech recognition.
Hybrid and adaptive extensions recover global context while preserving efficiency, making the method practical for language, speech, and vision tasks.

Restricted self-attention is a class of modifications to the self-attention mechanism in Transformer and related architectures that limits or constrains the set of context positions each token attends to. The goal is typically to control computational complexity, latency, or inductive bias by enforcing locality, sparsity, or hard masking in the attention patterns. Restricted self-attention arises in various forms—including fixed-size windows, masked local neighborhoods, dilated connection patterns, or exclusion of a token’s own value vector—and is foundational in large-scale sequence modeling for speech, language, and vision tasks. These restrictions yield substantial benefits in efficiency and hardware tractability, but also introduce characteristic trade-offs in modeling power and representation capacity compared to full, global attention. The following sections provide a comprehensive technical overview.

1. Formal Taxonomy and Core Mechanisms

Restricted self-attention subsumes a family of masking and sparsification schemes that alter the standard full-range attention operation. In canonical self-attention, each query $i$ attends to all keys $j$ in a sequence of length $L$ , yielding $O(L^2)$ time and memory complexity: $A = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right), \quad \mathrm{Out} = A V$ where $Q, K, V \in \mathbb{R}^{L \times d}$ .

Restricted self-attention imposes a mask $M \in \mathbb{R}^{L \times L}$ such that: $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq k \ -\infty & \text{otherwise} \end{cases}$ for a local window of size $2k+1$. The output becomes: $A_{i} = \mathrm{softmax}\left( (Q_i K^{\top} + M_i)/\sqrt{d_k} \right) V$ Variants include:

Sliding window: Each query attends to a fixed-width window around itself (Moritz et al., 2021, Sharma et al., 2021).
Dilated window: Attends to every $j$ 0-th key within a window (parameterizing local density) (Sharma et al., 2021).
Causal/local masks: Attend only to previous or future tokens within range $j$ 1 (past), $j$ 2 (look-ahead) (Moritz et al., 2021, Luo et al., 2021).
Exclusion masks: Remove a token’s own value vector from the aggregate, as in exclusive self-attention (Zhai, 10 Mar 2026).
Hardmax or sparse selection: Replace softmax by a combinatorial or hard selection, e.g., using a hardmax (Thuy et al., 2024), or use screening heuristics as in relevant attention for RelRNNs (Kerg et al., 2020).

This taxonomy covers both strictly local attention and broader class of sparsified or masked-attention mechanisms, extending to restricted feature interactions and computation-adaptive schemes.

2. Computational Complexity and Scaling Properties

The primary motivation for restricted self-attention is to reduce quadratic compute and memory overhead. The complexity characteristics are as follows:

Attention Type	Time Complexity	Memory Complexity	Description
Full (unrestricted)	$j$ 3	$j$ 4	All-to-all dependencies
Restricted (window)	$j$ 5	$j$ 6	Window size $j$ 7
Dilated	$j$ 8	$j$ 9	Summarized distant context

For example, in ASR with $L$ 0, $L$ 1: quadratic cost is $L$ 2 ops, but restricted attention costs $L$ 3 ops—a $L$ 4 reduction (Moritz et al., 2021, Sharma et al., 2021).

Theoretical lower bounds establish that restricted (e.g., sliding window) attention has best possible complexity $L$ 5 relative to window size $L$ 6, and no generic sub-linear workaround exists unless strong complexity conjectures fail (Keles et al., 2022).

3. Extensions: Dilation, Dynamic Masking, and Memory-Augmented Variants

Restricted self-attention often sacrifices access to global context. Several extensions address this limitation:

Dilated Self-Attention: Summarizes distant regions (via subsampling, mean-pooling, or attention-based pooling) and appends them as coarse features to the local window (Moritz et al., 2021). This augments local high-resolution context with low-resolution global summaries at only marginal cost. For input length $L$ 7 and chunk size $L$ 8, attending to $L$ 9 summaries achieves cost $O(L^2)$ 0.
Hybrid Memory-Augmented Attention: Integrates restricted local attention with lightweight recurrent memory (e.g., LSTM pathway), propagating long-range dependencies efficiently with per-timestep complexity $O(L^2)$ 1, thus achieving global receptive field with minimal overhead (Luo et al., 2021).
Relevancy Screening or Adaptive Buffering: Maintains a buffer of short-term states plus a selection of long-term states determined by relevance, supporting theoretically controlled gradient propagation and $O(L^2)$ 2 scaling in sequential models (Kerg et al., 2020).
Dual-Stream Attention: Separates causal and non-causal streams to prevent cumulative growth of look-ahead in deep stacks, fixing overall latency and improving frame-synchronous processing (Moritz et al., 2021).

These mechanisms enable restricted attention to recover much of the modeling power of global attention with negligible accuracy loss and substantial efficiency gain.

4. Empirical Results and Application Domains

Restricted self-attention is widely adopted in large-scale speech recognition, summarization, and generative image models, among others:

Speech Recognition / Summarization: On Wall Street Journal and LibriSpeech, restricting self-attention to local windows ( $O(L^2)$ 3) yields only a $O(L^2)$ 4 $O(L^2)$ 5 absolute WER loss compared to full self-attention while reducing computation by $O(L^2)$ 6. Using dilated self-attention recovers essentially all lost performance ( $O(L^2)$ 7 WER gap) and retains $O(L^2)$ 8 of the cost (Moritz et al., 2021, Sharma et al., 2021, Luo et al., 2021).
End-to-End Summarization: Enables direct document- or audio-level summarization (e.g., up to 100s of audio) that was previously infeasible with full attention due to hardware constraints. Models with restricted attention outperform strong cascaded baselines and substantially reduce parameter count (e.g., $O(L^2)$ 9M vs. $A = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right), \quad \mathrm{Out} = A V$ 0M) (Sharma et al., 2021).
Image Reconstruction: Linearized or restricted SA modules (e.g., channel-attention equivalents) enable comparable or identical sample quality to full SA on image benchmarks while reducing runtime and memory by $A = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right), \quad \mathrm{Out} = A V$ 1 (Wang et al., 2019).
Streaming and Online Inference: Restricted attention is compatible with stream processing in speech and LLMs, with explicit trade-offs between latency and context length (Moritz et al., 2021).

5. Theoretical Foundations and Limitations

Restricted attention presents fundamental trade-offs:

Contextual Expressivity: By excluding distant positions, restricted attention limits model capacity for long-range dependencies. Empirical and theoretical findings show that fully attentive models maintain stronger gradient signals and can encode deeper context chains, while purely local attention risks vanishing gradient problems unless recurrent or memory mechanisms are present (Kerg et al., 2020).
Error Guarantees: Complexity lower bounds show there is no general-purpose subquadratic algorithm for self-attention with the exponential or softmax kernel, even allowing for elementwise additive or multiplicative approximation; sliding-window and local attention bounds are tight up to $A = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right), \quad \mathrm{Out} = A V$ 2 under SETH (Keles et al., 2022).
Design Trade-offs: Increasing window size improves accuracy but increases compute and memory linearly. Dilated or hybrid schemes trade latency and granularity for global coverage. Non-adaptive schemes can miss rare but important long-range dependencies (Moritz et al., 2021, Sharma et al., 2021).

6. Specialized Constructions: Exclusion, Hardmax, and Logical Inference

Several recent restricted attention mechanisms encode additional structure:

Exclusive Self-Attention (XSA): Removes the component of the self-attention output parallel to the token’s own value vector, ensuring that attention focuses solely on contextual, not self, information (Zhai, 10 Mar 2026). This yields consistent gains in language modeling tasks across model sizes, particularly with long contexts.
Self-Attention for Logical Inference: By replacing softmax with hardmax and identity keys, restricted attention layers emulate step-wise symbolic proof systems for definite logic programs, demonstrating how restricted self-attention subsumes not only continuous sequence modeling but also discrete logical inference (Thuy et al., 2024).
Sparsification by Relevancy/Saliency: Screening mechanisms select a small buffer of states to attend over (e.g., via top- $A = \mathrm{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right), \quad \mathrm{Out} = A V$ 3 relevance), provably preserving gradient propagation in long sequences while remaining resource-efficient (Kerg et al., 2020).

These approaches further emphasize the flexibility of attention mechanisms under domain- or theory-driven constraints and provide new directions for tailored architectural inductive biases.

7. Open Problems, Future Research, and Evolving Landscape

Adaptive and Learnable Sparsity: Future directions include learning window size or dilation adaptively per layer, incorporating task-conditioned global slots (as in Longformer), or dynamically adjusting attention patterns (Sharma et al., 2021).
Entropy and Localization: Recent work highlights attention localization, where excessive sparsity can induce collapsed, low-entropy attention, starving multi-hop dependencies. Techniques such as belief propagation refinement explicitly inject multi-hop information, mitigating attention collapse in small models (Lee et al., 9 Sep 2025).
Hybridization and Multimodal Integration: Combining restricted attention with recurrence, memory, and architectural duality (dual streams) offers resource- and latency-limited models the means to achieve both efficiency and high-fidelity sequence modeling (Luo et al., 2021, Moritz et al., 2021).
Domain-Specific Designs: Restricted attention can be tuned for particular domains, e.g., speech, vision, or logical reasoning, providing a flexible apparatus for balancing hardware, inference, and learning constraints.

Restricted self-attention remains foundational in scaling attention-based models to long sequences and resource-constrained settings, with an ongoing research focus on extracting maximal expressivity and robustness under severe computational constraints. The design, analysis, and application of restricted attention variants continue to drive advances in large-scale sequence modeling and specialized architectures across modalities (Moritz et al., 2021, Sharma et al., 2021, Luo et al., 2021, Kerg et al., 2020, Moritz et al., 2021, Zhai, 10 Mar 2026, Keles et al., 2022, Thuy et al., 2024, Wang et al., 2019, Lee et al., 9 Sep 2025).