Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semi-Local Attention Mechanism

Updated 10 March 2026
  • Semi-Local Attention (SLA) is a hybrid attention mechanism integrating fixed sliding-window and residual linear streams to capture both local and global contexts.
  • It efficiently scales with sequence length by reducing computational and memory costs compared to full-attention models.
  • Kernel optimizations enable SLA to run effectively on modern accelerators, making it practical for large-scale transformer architectures.

Semi-Local Attention (SLA) is a hybrid attention mechanism that integrates a fixed-size sliding-window (local) attention with a residual linear-attention stream designed to capture contextual information from tokens outside the window. Introduced in the context of the RATTENTION model, SLA addresses intrinsic limitations of local-global attention architectures by enabling efficient context aggregation that scales favorably with sequence length, while maintaining or exceeding the accuracy of full-attention models at drastically reduced computational and memory costs (Wang et al., 18 Jun 2025).

1. Mathematical Structure of Semi-Local Attention

SLA fuses two computations: standard sliding-window attention (“SWA”) and a residual linear attention (“RLA”) stream.

1.1 Sliding-Window Attention

Given Q,K,VRn×dQ, K, V \in \mathbb{R}^{n \times d} (queries, keys, and values), for each position i[1,n]i \in [1, n], compute attention coefficients αi,j\alpha_{i,j} for the window j[max(1,iw+1),i]j \in [\max(1,i-w+1), i]: αi,j=exp(QiKj/d)k=max(1,iw+1)iexp(QiKk/d)\alpha_{i,j} = \frac{\exp(Q_i K_j^\top/\sqrt d)} {\sum_{k=\max(1,i-w+1)}^{i} \exp(Q_i K_k^\top/\sqrt d)} The local attention output: yiswa=j=max(1,iw+1)iαi,jVjy_i^{\rm swa} = \sum_{j=\max(1,i-w+1)}^i \alpha_{i,j} V_j

1.2 Residual Linear Attention

A feature map ϕ:RdRd\phi: \mathbb{R}^d \rightarrow \mathbb{R}^{d'} satisfying ϕ(x),ϕ(x)exp(xx/d)\langle \phi(x), \phi(x') \rangle \approx \exp(x^\top x'/\sqrt d) is employed. In practice, ϕ(x)=exp(x)\phi(x) = \exp(x) (elementwise), with RMSNorm. The recurrent state StRd×dS_t \in \mathbb{R}^{d' \times d} is updated sequentially: St=St1+ϕ(Kt)Vt,S0=0S_t = S_{t-1} + \phi(K_t)^\top V_t,\quad S_0 = 0 At position ii, the RLA stream reads from SiwS_{i-w}: yirla=ϕ(Qi)Siwy_i^{\rm rla} = \phi(Q_i) S_{i-w} For iwi \le w, Siw=0S_{i-w}=0.

1.3 Fusion

Both streams are fused: y~i=RMSNorm(yiswa)+RMSNorm(yirla)Rd\tilde y_i = \mathrm{RMSNorm}(y_i^{\rm swa}) + \mathrm{RMSNorm}(y_i^{\rm rla}) \in \mathbb{R}^d y~i\tilde y_i is projected via WORd×dW_O \in \mathbb{R}^{d \times d} before the standard Transformer residual and feed-forward blocks.

2. Layerwise Algorithmic Workflow

A single Transformer layer with SLA operates as follows:

  1. Linear Projection: Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V for layer input XRn×dX \in \mathbb{R}^{n \times d}.
  2. Iteration (across i=1,,ni=1, \ldots, n):
    • Compute sliding-window attention over [iw+1,,i][i-w+1, \ldots, i].
    • Update SS+ϕ(Ki)ViS \gets S + \phi(K_i)^\top V_i.
    • Read residual stream yirla=ϕ(Qi)Siwy_i^{\rm rla} = \phi(Q_i) S_{i-w} ($0$ if iwi\leq w).
    • Fuse: y~i\tilde y_i as above.
  3. Output: Y=Y~WO+XY = \tilde Y W_O + X.
  4. Feed-forward: Z=SwiGLU(LN(Y))WFF+YZ = \mathrm{SwiGLU}(\mathrm{LN}(Y)) W_{FF} + Y.

Architecturally, RATTENTION interleaves three SLA layers with one full-attention layer.

3. Computational Complexity Analysis

The table below summarizes comparative asymptotic costs, where n=n= sequence length, d=d= hidden size, w=w= window size, d=dd'=d:

Mechanism Time Complexity Memory Complexity
Full Attention O(n2d)O(n^2d) O(n2)O(n^2)
Sliding-Window (SWA) O(nwd)O(nwd) O(nw)O(nw)
SLA (SWA + RLA) O(nwd+nd2)O(nwd + nd^2) O(nw+d2+Cd)O(nw + d^2 + Cd)
  • The local part matches SWA cost; the linear part adds O(nd2)O(nd^2) extra time and O(d2)O(d^2) memory.
  • SLA provides a strict efficiency gain when wnw \ll n and dnd \ll n; SS is a constant-size recurrent state.
  • SLA sits between local and global attention, offering a trade-off unattainable by either independently.

4. Kernel and Implementation Considerations

RATTENTION employs two kernel optimizations to ensure SLA’s practicality on accelerators (e.g., TPU, GPU):

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Local Attention (SLA).