Semi-Local Attention Mechanism
- Semi-Local Attention (SLA) is a hybrid attention mechanism integrating fixed sliding-window and residual linear streams to capture both local and global contexts.
- It efficiently scales with sequence length by reducing computational and memory costs compared to full-attention models.
- Kernel optimizations enable SLA to run effectively on modern accelerators, making it practical for large-scale transformer architectures.
Semi-Local Attention (SLA) is a hybrid attention mechanism that integrates a fixed-size sliding-window (local) attention with a residual linear-attention stream designed to capture contextual information from tokens outside the window. Introduced in the context of the RATTENTION model, SLA addresses intrinsic limitations of local-global attention architectures by enabling efficient context aggregation that scales favorably with sequence length, while maintaining or exceeding the accuracy of full-attention models at drastically reduced computational and memory costs (Wang et al., 18 Jun 2025).
1. Mathematical Structure of Semi-Local Attention
SLA fuses two computations: standard sliding-window attention (“SWA”) and a residual linear attention (“RLA”) stream.
1.1 Sliding-Window Attention
Given (queries, keys, and values), for each position , compute attention coefficients for the window : The local attention output:
1.2 Residual Linear Attention
A feature map satisfying is employed. In practice, (elementwise), with RMSNorm. The recurrent state is updated sequentially: At position , the RLA stream reads from : For , .
1.3 Fusion
Both streams are fused: is projected via before the standard Transformer residual and feed-forward blocks.
2. Layerwise Algorithmic Workflow
A single Transformer layer with SLA operates as follows:
- Linear Projection: , , for layer input .
- Iteration (across ):
- Compute sliding-window attention over .
- Update .
- Read residual stream ($0$ if ).
- Fuse: as above.
- Output: .
- Feed-forward: .
Architecturally, RATTENTION interleaves three SLA layers with one full-attention layer.
3. Computational Complexity Analysis
The table below summarizes comparative asymptotic costs, where sequence length, hidden size, window size, :
| Mechanism | Time Complexity | Memory Complexity |
|---|---|---|
| Full Attention | ||
| Sliding-Window (SWA) | ||
| SLA (SWA + RLA) |
- The local part matches SWA cost; the linear part adds extra time and memory.
- SLA provides a strict efficiency gain when and ; is a constant-size recurrent state.
- SLA sits between local and global attention, offering a trade-off unattainable by either independently.
4. Kernel and Implementation Considerations
RATTENTION employs two kernel optimizations to ensure SLA’s practicality on accelerators (e.g., TPU, GPU):