Time-Restricted Self-Attention (TRSA)

Updated 26 March 2026

Time-Restricted Self-Attention (TRSA) is a self-attention paradigm that confines each token’s attention to a symmetric local window, thereby reducing quadratic complexity.
It incorporates dilated and hierarchical extensions to efficiently combine local focus with long-range contextual dependencies.
TRSA has proven effective in applications like ASR and speech summarization, achieving near-full accuracy with significant reductions in computation and latency.

Time-Restricted Self-Attention (TRSA) is a self-attention paradigm for neural sequence models that constrains the set of key/value positions available to each query token to a bounded, typically symmetric window in the sequence, optionally coupled with mechanisms to capture long-range dependencies at lower resolution. TRSA was developed to address the prohibitive quadratic complexity of full self-attention in domains such as speech recognition and summarization with long input sequences, outperforming comparable alternatives in resource utilization and latency while achieving competitive accuracy (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021).

1. Mathematical Formulation and Core Principles

TRSA modifies the canonical multi-head self-attention operation. Let $X=(x_1,\ldots,x_N)$ denote the sequence of input vectors, with $x_t\in\mathbb{R}^{d_{model}}$ . Standard multi-head self-attention computes, for each layer and head,

$Q = X W^Q;\quad K = X W^K;\quad V = X W^V$

with $W^Q,W^K,W^V\in\mathbb{R}^{d_{model}\times d_k}$ , $d_k = d_{model}/H$ for $H$ heads.

In full attention, the output for query position $t$ is

$y_t = \sum_{j=1}^N \mathrm{softmax}_j\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right)\, v_j.$

In TRSA, attention is restricted to a local window: for each $t$ , only positions $t' \in [t-w,\, t+w]$ (window size $R=2w+1$ ) contribute:

$y_t = \sum_{t'=t-w}^{t+w} \bar{\alpha}_{t,t'}\, v_{t'},\quad \bar{\alpha}_{t,t'} = \frac{\exp(q_t\cdot k_{t'}/\sqrt{d_k})}{\sum_{s=t-w}^{t+w}\exp(q_t\cdot k_s/\sqrt{d_k})}.$

Optionally, a dilation factor $D$ is used so that only every $D$ th key/value is available within the window, further reducing complexity:

$I_t = \left\{ i\,|\,\max(1, t-w)\leq i\leq \min(N, t+w),\, i\equiv t \pmod D\right\}.$

These restrictions are applied as a mask to the attention logits prior to softmax normalization.

2. Long-Range Context: Dilated/Hierarchical Extensions

TRSA inherently limits context length. To address this, dilated self-attention augments the local window with a set of summary tokens encoding distant context at lower resolution (Moritz et al., 2021). The input sequence is segmented into non-overlapping chunks of length $M$ , producing $L=\lceil N/M \rceil$ chunks per head. For each chunk $l$ and head $i$ :

Keys and values within the chunk, $C^K_{i,l}$ $C_{i, l}^{K}$ and $C^V_{i,l}$ $C_{i, l}^{V}$ , are summarized to $\Delta^K_{i,l}$ $Δ_{i, l}^{K}$ , $\Delta^V_{i,l}$ $Δ_{i, l}^{V}$ using one of:
- Subsampling: Take first token.
- Mean Pooling: Average all tokens.
- Attention-Based Pooling (AP): Apply trainsable attention heads over the chunk, optionally followed by a two-layer feed-forward processing module.

The summary tokens $\{\Delta^K_{i,l}\},\,\{\Delta^V_{i,l}\}$ for all $l$ are concatenated to the local window, so each query attends to both its local neighborhood and global summaries.

3. Computational Complexity

For input length $N$ , window size $R$ , and model dimension $d$ :

Full self-attention: $\mathcal{O}(N^2 d)$ .
TRSA (windowed): $\mathcal{O}(N R d)$ (linear in $N$ if $R$ is fixed).
TRSA + dilation: $\mathcal{O}[N(R + L)d]$ with summarization overhead dependent on the method (e.g., AP induces $\mathcal{O}(N B d)$ for $B$ heads).

In empirical settings (e.g., $N\approx 10,000$ in speech), TRSA yields up to $250\times$ reduction in attention dot-products and memory relative to full attention (Sharma et al., 2021).

4. Integration into Neural Architectures

TRSA can be substituted in Transformer or Conformer encoders by replacing the self-attention modules with their windowed counterparts. Encoder-decoder attention and decoder self-attention (for output sequences of manageable length) often remain full (Sharma et al., 2021). The choice of window size $w$ (or $R$ ) and dilation $D$ are treated as hyperparameters. In ASR and long-form summarization, experimentation has shown that moderate window sizes ( $R\approx 40$ ) and high dilation ( $D=5\,{\rm to}\, 55$ ) optimally balance performance and efficiency. The same $(W,D)$ settings may be shared across all encoder layers (Sharma et al., 2021).

Streaming implementations maintain unidirectional or limited look-ahead windows across all layers for causality, with predictable, constant latency proportional to $W$ and the number of layers (Moritz et al., 2021).

5. Empirical Performance and Trade-Offs

Major empirical results demonstrate that TRSA, with or without dilation, matches nearly the accuracy of full attention at a fraction (10–20%) of the computational cost:

In ASR on LibriSpeech and WSJ, dilated TRSA models yield WER within $0.2$– $0.3\%$ absolute of full attention, at $6\times$ – $10\times$ speedup (Moritz et al., 2021).
End-to-end speech summarization using TRSA achieves $\approx 3.6$ absolute ROUGE-L improvement over cascaded ASR+summarizer baselines, and enables processing of sequences too long for full attention, with $250\times$ computation and memory savings (Sharma et al., 2021).
In streaming ASR, TRSA achieves near-offline WER while ensuring constant algorithmic latency and supporting frame-synchronous processing. Compared to chunk-based attention, TRSA enables exact control over context window and latency (Moritz et al., 2021).

Key ablation findings include:

Smaller windows incur WER/ROUGE penalties that are mostly recoverable by dilation.
Attention-based pooling for summary tokens dominates mean-pooling and subsampling; adding post-processing further enhances robustness.
Increasing attention heads for summary tokens offers diminishing returns.
Chunk size $M$ trades off context coarseness and efficiency; for ASR, $M\approx 20$ yields a favorable balance (Moritz et al., 2021).

TRSA is distinct from:

Chunk-based Self-Attention (CSA): CSA divides the sequence into overlapping chunks and applies attention locally, waiting for chunk completion before output, leading to non-uniform latency and repeated computation for overlapped frames. TRSA operates in a strictly sliding-window, frame-synchronous fashion, offering fixed latency and no recomputation (Moritz et al., 2021).
Dual Causal/Non-Causal Attention: This variant limits total context expansion over layers by controlling the growth of accessible past and future frames, offering slightly improved performance over pure TRSA, but at increased implementation complexity (Moritz et al., 2021).
Temporal Attention (not to be confused with TRSA): Temporal attention introduces time-specific embeddings into the attention calculation, but does not restrict attention by window or latency (Rosin et al., 2022).

7. Applications and Future Directions

TRSA is primarily established in large-scale ASR, speech summarization, and spoken language understanding on long or streaming inputs (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021). Typical deployments include:

Speech recognizers and summarizers on datasets like LibriSpeech, WSJ, How-2, HKUST, and Switchboard.
Streaming and low-latency inference scenarios due to deterministic, layer-wise control over look-behind and look-ahead.

A plausible implication is that TRSA's locality and efficiency properties render it adaptable to other sequential and time-series domains where quadratic attention costs are prohibitive and bounded-latency is required.

References:

"Capturing Multi-Resolution Context by Dilated Self-Attention" (Moritz et al., 2021)
"Speech Summarization using Restricted Self-Attention" (Sharma et al., 2021)
"Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition" (Moritz et al., 2021)
"Temporal Attention for LLMs" (Rosin et al., 2022)

Markdown Report Issue Upgrade to Chat

References (4)

Capturing Multi-Resolution Context by Dilated Self-Attention (2021)

Speech Summarization using Restricted Self-Attention (2021)

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition (2021)

Temporal Attention for Language Models (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time-Restricted Self-Attention (TRSA).