Papers
Topics
Authors
Recent
Search
2000 character limit reached

Time-Restricted Self-Attention (TRSA)

Updated 26 March 2026
  • Time-Restricted Self-Attention (TRSA) is a self-attention paradigm that confines each token’s attention to a symmetric local window, thereby reducing quadratic complexity.
  • It incorporates dilated and hierarchical extensions to efficiently combine local focus with long-range contextual dependencies.
  • TRSA has proven effective in applications like ASR and speech summarization, achieving near-full accuracy with significant reductions in computation and latency.

Time-Restricted Self-Attention (TRSA) is a self-attention paradigm for neural sequence models that constrains the set of key/value positions available to each query token to a bounded, typically symmetric window in the sequence, optionally coupled with mechanisms to capture long-range dependencies at lower resolution. TRSA was developed to address the prohibitive quadratic complexity of full self-attention in domains such as speech recognition and summarization with long input sequences, outperforming comparable alternatives in resource utilization and latency while achieving competitive accuracy (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021).

1. Mathematical Formulation and Core Principles

TRSA modifies the canonical multi-head self-attention operation. Let X=(x1,,xN)X=(x_1,\ldots,x_N) denote the sequence of input vectors, with xtRdmodelx_t\in\mathbb{R}^{d_{model}}. Standard multi-head self-attention computes, for each layer and head,

Q=XWQ;K=XWK;V=XWVQ = X W^Q;\quad K = X W^K;\quad V = X W^V

with WQ,WK,WVRdmodel×dkW^Q,W^K,W^V\in\mathbb{R}^{d_{model}\times d_k}, dk=dmodel/Hd_k = d_{model}/H for HH heads.

In full attention, the output for query position tt is

yt=j=1Nsoftmaxj(qtkjdk)vj.y_t = \sum_{j=1}^N \mathrm{softmax}_j\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right)\, v_j.

In TRSA, attention is restricted to a local window: for each tt, only positions t[tw,t+w]t' \in [t-w,\, t+w] (window size R=2w+1R=2w+1) contribute:

yt=t=twt+wαˉt,tvt,αˉt,t=exp(qtkt/dk)s=twt+wexp(qtks/dk).y_t = \sum_{t'=t-w}^{t+w} \bar{\alpha}_{t,t'}\, v_{t'},\quad \bar{\alpha}_{t,t'} = \frac{\exp(q_t\cdot k_{t'}/\sqrt{d_k})}{\sum_{s=t-w}^{t+w}\exp(q_t\cdot k_s/\sqrt{d_k})}.

Optionally, a dilation factor DD is used so that only every DDth key/value is available within the window, further reducing complexity:

It={imax(1,tw)imin(N,t+w),it(modD)}.I_t = \left\{ i\,|\,\max(1, t-w)\leq i\leq \min(N, t+w),\, i\equiv t \pmod D\right\}.

These restrictions are applied as a mask to the attention logits prior to softmax normalization.

2. Long-Range Context: Dilated/Hierarchical Extensions

TRSA inherently limits context length. To address this, dilated self-attention augments the local window with a set of summary tokens encoding distant context at lower resolution (Moritz et al., 2021). The input sequence is segmented into non-overlapping chunks of length MM, producing L=N/ML=\lceil N/M \rceil chunks per head. For each chunk ll and head ii:

  • Keys and values within the chunk, Ci,lKC^K_{i,l} and Ci,lVC^V_{i,l}, are summarized to Δi,lK\Delta^K_{i,l}, Δi,lV\Delta^V_{i,l} using one of:
    • Subsampling: Take first token.
    • Mean Pooling: Average all tokens.
    • Attention-Based Pooling (AP): Apply trainsable attention heads over the chunk, optionally followed by a two-layer feed-forward processing module.

The summary tokens {Δi,lK},{Δi,lV}\{\Delta^K_{i,l}\},\,\{\Delta^V_{i,l}\} for all ll are concatenated to the local window, so each query attends to both its local neighborhood and global summaries.

3. Computational Complexity

For input length NN, window size RR, and model dimension dd:

  • Full self-attention: O(N2d)\mathcal{O}(N^2 d).
  • TRSA (windowed): O(NRd)\mathcal{O}(N R d) (linear in NN if RR is fixed).
  • TRSA + dilation: O[N(R+L)d]\mathcal{O}[N(R + L)d] with summarization overhead dependent on the method (e.g., AP induces O(NBd)\mathcal{O}(N B d) for BB heads).

In empirical settings (e.g., N10,000N\approx 10,000 in speech), TRSA yields up to 250×250\times reduction in attention dot-products and memory relative to full attention (Sharma et al., 2021).

4. Integration into Neural Architectures

TRSA can be substituted in Transformer or Conformer encoders by replacing the self-attention modules with their windowed counterparts. Encoder-decoder attention and decoder self-attention (for output sequences of manageable length) often remain full (Sharma et al., 2021). The choice of window size ww (or RR) and dilation DD are treated as hyperparameters. In ASR and long-form summarization, experimentation has shown that moderate window sizes (R40R\approx 40) and high dilation (D=5to55D=5\,{\rm to}\, 55) optimally balance performance and efficiency. The same (W,D)(W,D) settings may be shared across all encoder layers (Sharma et al., 2021).

Streaming implementations maintain unidirectional or limited look-ahead windows across all layers for causality, with predictable, constant latency proportional to WW and the number of layers (Moritz et al., 2021).

5. Empirical Performance and Trade-Offs

Major empirical results demonstrate that TRSA, with or without dilation, matches nearly the accuracy of full attention at a fraction (10–20%) of the computational cost:

  • In ASR on LibriSpeech and WSJ, dilated TRSA models yield WER within $0.2$–0.3%0.3\% absolute of full attention, at 6×6\times10×10\times speedup (Moritz et al., 2021).
  • End-to-end speech summarization using TRSA achieves 3.6\approx 3.6 absolute ROUGE-L improvement over cascaded ASR+summarizer baselines, and enables processing of sequences too long for full attention, with 250×250\times computation and memory savings (Sharma et al., 2021).
  • In streaming ASR, TRSA achieves near-offline WER while ensuring constant algorithmic latency and supporting frame-synchronous processing. Compared to chunk-based attention, TRSA enables exact control over context window and latency (Moritz et al., 2021).

Key ablation findings include:

  • Smaller windows incur WER/ROUGE penalties that are mostly recoverable by dilation.
  • Attention-based pooling for summary tokens dominates mean-pooling and subsampling; adding post-processing further enhances robustness.
  • Increasing attention heads for summary tokens offers diminishing returns.
  • Chunk size MM trades off context coarseness and efficiency; for ASR, M20M\approx 20 yields a favorable balance (Moritz et al., 2021).

TRSA is distinct from:

  • Chunk-based Self-Attention (CSA): CSA divides the sequence into overlapping chunks and applies attention locally, waiting for chunk completion before output, leading to non-uniform latency and repeated computation for overlapped frames. TRSA operates in a strictly sliding-window, frame-synchronous fashion, offering fixed latency and no recomputation (Moritz et al., 2021).
  • Dual Causal/Non-Causal Attention: This variant limits total context expansion over layers by controlling the growth of accessible past and future frames, offering slightly improved performance over pure TRSA, but at increased implementation complexity (Moritz et al., 2021).
  • Temporal Attention (not to be confused with TRSA): Temporal attention introduces time-specific embeddings into the attention calculation, but does not restrict attention by window or latency (Rosin et al., 2022).

7. Applications and Future Directions

TRSA is primarily established in large-scale ASR, speech summarization, and spoken language understanding on long or streaming inputs (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021). Typical deployments include:

  • Speech recognizers and summarizers on datasets like LibriSpeech, WSJ, How-2, HKUST, and Switchboard.
  • Streaming and low-latency inference scenarios due to deterministic, layer-wise control over look-behind and look-ahead.

A plausible implication is that TRSA's locality and efficiency properties render it adaptable to other sequential and time-series domains where quadratic attention costs are prohibitive and bounded-latency is required.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time-Restricted Self-Attention (TRSA).