Semi-Local Attention Mechanism

Updated 10 March 2026

Semi-Local Attention (SLA) is a hybrid attention mechanism integrating fixed sliding-window and residual linear streams to capture both local and global contexts.
It efficiently scales with sequence length by reducing computational and memory costs compared to full-attention models.
Kernel optimizations enable SLA to run effectively on modern accelerators, making it practical for large-scale transformer architectures.

Semi-Local Attention (SLA) is a hybrid attention mechanism that integrates a fixed-size sliding-window (local) attention with a residual linear-attention stream designed to capture contextual information from tokens outside the window. Introduced in the context of the RATTENTION model, SLA addresses intrinsic limitations of local-global attention architectures by enabling efficient context aggregation that scales favorably with sequence length, while maintaining or exceeding the accuracy of full-attention models at drastically reduced computational and memory costs (Wang et al., 18 Jun 2025).

1. Mathematical Structure of Semi-Local Attention

SLA fuses two computations: standard sliding-window attention (“SWA”) and a residual linear attention (“RLA”) stream.

1.1 Sliding-Window Attention

Given $Q, K, V \in \mathbb{R}^{n \times d}$ (queries, keys, and values), for each position $i \in [1, n]$ , compute attention coefficients $\alpha_{i,j}$ for the window $j \in [\max(1,i-w+1), i]$ : $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ The local attention output: $y_i^{\rm swa} = \sum_{j=\max(1,i-w+1)}^i \alpha_{i,j} V_j$

1.2 Residual Linear Attention

A feature map $\phi: \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ satisfying $\langle \phi(x), \phi(x') \rangle \approx \exp(x^\top x'/\sqrt d)$ is employed. In practice, $\phi(x) = \exp(x)$ (elementwise), with RMSNorm. The recurrent state $S_t \in \mathbb{R}^{d' \times d}$ is updated sequentially: $i \in [1, n]$ 0 At position $i \in [1, n]$ 1, the RLA stream reads from $i \in [1, n]$ 2: $i \in [1, n]$ 3 For $i \in [1, n]$ 4, $i \in [1, n]$ 5.

1.3 Fusion

Both streams are fused: $i \in [1, n]$ 6 $i \in [1, n]$ 7 is projected via $i \in [1, n]$ 8 before the standard Transformer residual and feed-forward blocks.

2. Layerwise Algorithmic Workflow

A single Transformer layer with SLA operates as follows:

Linear Projection: $i \in [1, n]$ 9, $\alpha_{i,j}$ 0, $\alpha_{i,j}$ 1 for layer input $\alpha_{i,j}$ 2.
Iteration (across $\alpha_{i,j}$ 3):
- Compute sliding-window attention over $\alpha_{i,j}$ 4.
- Update $\alpha_{i,j}$ 5.
- Read residual stream $\alpha_{i,j}$ 6 ( $\alpha_{i,j}$ 7 if $\alpha_{i,j}$ 8).
- Fuse: $\alpha_{i,j}$ 9 as above.
Output: $j \in [\max(1,i-w+1), i]$ 0.
Feed-forward: $j \in [\max(1,i-w+1), i]$ 1.

Architecturally, RATTENTION interleaves three SLA layers with one full-attention layer.

3. Computational Complexity Analysis

The table below summarizes comparative asymptotic costs, where $j \in [\max(1,i-w+1), i]$ 2 sequence length, $j \in [\max(1,i-w+1), i]$ 3 hidden size, $j \in [\max(1,i-w+1), i]$ 4 window size, $j \in [\max(1,i-w+1), i]$ 5:

Mechanism	Time Complexity	Memory Complexity
Full Attention	$j \in [\max(1,i-w+1), i]$ 6	$j \in [\max(1,i-w+1), i]$ 7
Sliding-Window (SWA)	$j \in [\max(1,i-w+1), i]$ 8	$j \in [\max(1,i-w+1), i]$ 9
SLA (SWA + RLA)	$\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 0	$\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 1

The local part matches SWA cost; the linear part adds $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 2 extra time and $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 3 memory.
SLA provides a strict efficiency gain when $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 4 and $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 5; $\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)}$ 6 is a constant-size recurrent state.
SLA sits between local and global attention, offering a trade-off unattainable by either independently.

4. Kernel and Implementation Considerations

RATTENTION employs two kernel optimizations to ensure SLA’s practicality on accelerators (e.g., TPU, GPU):

Markdown Report Issue Upgrade to Chat

References (1)

RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Local Attention (SLA).