Time-Restricted Self-Attention (TRSA)
- Time-Restricted Self-Attention (TRSA) is a self-attention paradigm that confines each token’s attention to a symmetric local window, thereby reducing quadratic complexity.
- It incorporates dilated and hierarchical extensions to efficiently combine local focus with long-range contextual dependencies.
- TRSA has proven effective in applications like ASR and speech summarization, achieving near-full accuracy with significant reductions in computation and latency.
Time-Restricted Self-Attention (TRSA) is a self-attention paradigm for neural sequence models that constrains the set of key/value positions available to each query token to a bounded, typically symmetric window in the sequence, optionally coupled with mechanisms to capture long-range dependencies at lower resolution. TRSA was developed to address the prohibitive quadratic complexity of full self-attention in domains such as speech recognition and summarization with long input sequences, outperforming comparable alternatives in resource utilization and latency while achieving competitive accuracy (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021).
1. Mathematical Formulation and Core Principles
TRSA modifies the canonical multi-head self-attention operation. Let denote the sequence of input vectors, with . Standard multi-head self-attention computes, for each layer and head,
with , for heads.
In full attention, the output for query position is
In TRSA, attention is restricted to a local window: for each , only positions (window size ) contribute:
Optionally, a dilation factor is used so that only every th key/value is available within the window, further reducing complexity:
These restrictions are applied as a mask to the attention logits prior to softmax normalization.
2. Long-Range Context: Dilated/Hierarchical Extensions
TRSA inherently limits context length. To address this, dilated self-attention augments the local window with a set of summary tokens encoding distant context at lower resolution (Moritz et al., 2021). The input sequence is segmented into non-overlapping chunks of length , producing chunks per head. For each chunk and head :
- Keys and values within the chunk, and , are summarized to , using one of:
- Subsampling: Take first token.
- Mean Pooling: Average all tokens.
- Attention-Based Pooling (AP): Apply trainsable attention heads over the chunk, optionally followed by a two-layer feed-forward processing module.
The summary tokens for all are concatenated to the local window, so each query attends to both its local neighborhood and global summaries.
3. Computational Complexity
For input length , window size , and model dimension :
- Full self-attention: .
- TRSA (windowed): (linear in if is fixed).
- TRSA + dilation: with summarization overhead dependent on the method (e.g., AP induces for heads).
In empirical settings (e.g., in speech), TRSA yields up to reduction in attention dot-products and memory relative to full attention (Sharma et al., 2021).
4. Integration into Neural Architectures
TRSA can be substituted in Transformer or Conformer encoders by replacing the self-attention modules with their windowed counterparts. Encoder-decoder attention and decoder self-attention (for output sequences of manageable length) often remain full (Sharma et al., 2021). The choice of window size (or ) and dilation are treated as hyperparameters. In ASR and long-form summarization, experimentation has shown that moderate window sizes () and high dilation () optimally balance performance and efficiency. The same settings may be shared across all encoder layers (Sharma et al., 2021).
Streaming implementations maintain unidirectional or limited look-ahead windows across all layers for causality, with predictable, constant latency proportional to and the number of layers (Moritz et al., 2021).
5. Empirical Performance and Trade-Offs
Major empirical results demonstrate that TRSA, with or without dilation, matches nearly the accuracy of full attention at a fraction (10–20%) of the computational cost:
- In ASR on LibriSpeech and WSJ, dilated TRSA models yield WER within $0.2$– absolute of full attention, at – speedup (Moritz et al., 2021).
- End-to-end speech summarization using TRSA achieves absolute ROUGE-L improvement over cascaded ASR+summarizer baselines, and enables processing of sequences too long for full attention, with computation and memory savings (Sharma et al., 2021).
- In streaming ASR, TRSA achieves near-offline WER while ensuring constant algorithmic latency and supporting frame-synchronous processing. Compared to chunk-based attention, TRSA enables exact control over context window and latency (Moritz et al., 2021).
Key ablation findings include:
- Smaller windows incur WER/ROUGE penalties that are mostly recoverable by dilation.
- Attention-based pooling for summary tokens dominates mean-pooling and subsampling; adding post-processing further enhances robustness.
- Increasing attention heads for summary tokens offers diminishing returns.
- Chunk size trades off context coarseness and efficiency; for ASR, yields a favorable balance (Moritz et al., 2021).
6. Comparison with Related Methods
TRSA is distinct from:
- Chunk-based Self-Attention (CSA): CSA divides the sequence into overlapping chunks and applies attention locally, waiting for chunk completion before output, leading to non-uniform latency and repeated computation for overlapped frames. TRSA operates in a strictly sliding-window, frame-synchronous fashion, offering fixed latency and no recomputation (Moritz et al., 2021).
- Dual Causal/Non-Causal Attention: This variant limits total context expansion over layers by controlling the growth of accessible past and future frames, offering slightly improved performance over pure TRSA, but at increased implementation complexity (Moritz et al., 2021).
- Temporal Attention (not to be confused with TRSA): Temporal attention introduces time-specific embeddings into the attention calculation, but does not restrict attention by window or latency (Rosin et al., 2022).
7. Applications and Future Directions
TRSA is primarily established in large-scale ASR, speech summarization, and spoken language understanding on long or streaming inputs (Moritz et al., 2021, Sharma et al., 2021, Moritz et al., 2021). Typical deployments include:
- Speech recognizers and summarizers on datasets like LibriSpeech, WSJ, How-2, HKUST, and Switchboard.
- Streaming and low-latency inference scenarios due to deterministic, layer-wise control over look-behind and look-ahead.
A plausible implication is that TRSA's locality and efficiency properties render it adaptable to other sequential and time-series domains where quadratic attention costs are prohibitive and bounded-latency is required.
References:
- "Capturing Multi-Resolution Context by Dilated Self-Attention" (Moritz et al., 2021)
- "Speech Summarization using Restricted Self-Attention" (Sharma et al., 2021)
- "Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition" (Moritz et al., 2021)
- "Temporal Attention for LLMs" (Rosin et al., 2022)