Papers
Topics
Authors
Recent
2000 character limit reached

Selective & Sliding Tile Attention (SSTA)

Updated 26 November 2025
  • Selective and Sliding Tile Attention (SSTA) is a family of sparse attention mechanisms that partitions inputs into tiles to capture key global and local features.
  • SSTA combines selective attention to salient blocks with sliding-window attention for continuity, providing nearly linear scalability in transformers.
  • Empirical results show SSTA enables significant speedups and memory savings in video diffusion transformers and long-context language models while maintaining high quality.

Selective and Sliding Tile Attention (SSTA) refers to a family of sparse attention mechanisms designed to achieve high computational efficiency and memory savings when scaling transformers to long video or text contexts. SSTA unifies two principles: sparse selection of salient blocks (tiles) to retain global context and a local static sliding-window to preserve short-range continuity. Foundational to SSTA is the empirical insight that, for both video and text, full attention matrices exhibit a strong tile/block structure with most significant attention mass concentrated along the block-diagonal and in a handful of global reference regions. SSTA mechanisms have been incorporated in large-scale video diffusion transformers (DiTs) and long-context LLMs to achieve close-to-linear scaling, substantial wallclock acceleration, and reduced memory, while maintaining or even exceeding the quality of dense-attention baselines (Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025).

1. Motivating Redundancy Patterns in Full Attention

In transformers operating over spatiotemporal (e.g., video) or long textual data, attention matrices naturally take the form of large, regularly partitioned blocks—"tiles." For a video with TT frames, each of size H×WH \times W, the N×NN \times N attention matrix (N=T⋅H⋅WN = T\cdot H\cdot W) splits into T×TT\times T tiles of size (H⋅W)×(H⋅W)(H\cdot W)\times (H\cdot W). Empirical analysis demonstrates that, after training, the heaviest attention weights aggregate along the main diagonal (tiles corresponding to within-frame interactions) and in a few global reference frames, while off-diagonal tiles are highly repetitive and low-importance. In text, similar block or sequence-local structure emerges, especially under architectures such as NSA which expose local and global context separately (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025). This suggests dense full attention computation is dominated by a small, structured subset.

2. Formal Definitions and Core SSTA Variants

Tile Partitioning

Tokens are mapped to 3D spatiotemporal coordinates (for video) or 1D sequence indices (for text). The full token set is partitioned into non-overlapping blocks ("tiles"):

  • Video: Tiles span contiguous frames or spatial subregions: (ttile,htile,wtile)(t_\mathrm{tile}, h_\mathrm{tile}, w_\mathrm{tile}).
  • Text: Tiles are contiguous subsequences of length BB.

Selective Tile Attention (STA)

STA restricts attention to a sparse subset of tiles:

  • All main-diagonal tiles (self and within-frame/within-block).
  • A fixed set of kk "global reference" blocks, either pre-selected or dynamically chosen by a scoring function.

The mask for query ii and key jj is:

Mselect(i,j)={0if tj=ti or tj∈G −∞otherwiseM_\mathrm{select}(i, j) = \begin{cases} 0 & \text{if } t_j = t_i \text{ or } t_j \in G \ -\infty & \text{otherwise} \end{cases}

The sparse attention is:

Attentionsel(Q,K,V)=softmax(QK⊤d+Mselect)V\text{Attention}_\mathrm{sel}(Q, K, V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} + M_\mathrm{select} \right)V

(Ding et al., 10 Feb 2025)

Sliding Tile Attention (SlTA)

To capture local context missed by global selection, SlTA augments STA by enabling attention within a sliding temporal or spatial window (e.g., ww adjacent frames or blocks). The corresponding mask is

Mslide(i,j)={0if tj=ti or ∣tj−ti∣≤⌊w/2⌋ or tj∈G −∞otherwiseM_\mathrm{slide}(i, j) = \begin{cases} 0 & \text{if } t_j = t_i \text{ or } |t_j - t_i| \leq \lfloor w/2 \rfloor \text{ or } t_j \in G \ -\infty & \text{otherwise} \end{cases}

(Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025)

3. Block-Level and Hybrid SSTA Implementations

SSTA can be unified and generalized beyond simple spatiotemporal tiles:

  • Block Partition and Pooling: For arbitrary tensors, tokens are partitioned into blocks of size NN; each block is pooled (e.g., adaptive average) to a dd-dim vector.
  • Block Importance Scoring: Inter-block importance is computed via pooled features and redundancy measures; query blocks select their top-kk key blocks.
  • Hybrid Local+Global Mask: The final mask is the intersection (logical and) of learned selective (top-kk) and static sliding window masks:

Mcomb=Msel∧MstaM_\mathrm{comb} = M_\mathrm{sel} \land M_\mathrm{sta}

Efficient block-sparse kernels (e.g., flex_block_attention, CUDA-optimized) are used for implementation; SSTA is modular and can drop into standard transformer blocks (Wu et al., 24 Nov 2025).

4. Algorithmic Structure and Pseudocode

A typical forward pass for SSTA:

1
2
3
4
5
6
7
8
9
10
11
Q_b, K_b, V_b = split_into_blocks(Q, K, V, tile)    # shape [h, B, N, d]
QÌ„ = adaptive_avg_pool(Q_b, out_size=1)             # [h, B, d]
KÌ„ = adaptive_avg_pool(K_b, out_size=1)             # [h, B, d]
S_s = einsum('h i d, h j d -> h i j', QÌ„, KÌ„)       # block similarity
R   = compute_redundancy(K_b)                       # intra-block redundancy
S_i = λ * S_s - β * R                               # block importance scores
idx = topk(S_i, k, dim=-1)                          # indices of selected blocks
M_sel = index_to_mask(idx, B)                       # select mask [h, B, B]
M_sta = make_sliding_mask(B, tile, window)          # local mask [B, B]
M = M_sel & M_sta.unsqueeze(0)                      # block mask [h, B, B]
O = flex_block_attention(Q, K, V, block_mask=M)
(Wu et al., 24 Nov 2025)

Language SSTA architectures alternate tile (sliding-window/MLA) and global (compression+selection/GLA) layers, further optimizing to halve the KV-cache footprint (Hu et al., 2 Nov 2025).

5. Complexity and Memory Analysis

Mechanism Compute Complexity Memory (KV-cache) Reduction
Full Attention O(N2)O(N^2) or O(L2)O(L^2) None
STA O(N k)O(N\,k) or O(BNk)O(BNk) Linear in tokens (k≪Nk\ll N)
SlTA O(N w)O(N\,w) or O(BNw)O(BNw) Linear in tokens (w≪Nw\ll N)
SSTA (Hybrid) O(N(k+w))O(N(k+w)) or O(BN(k+∣Wsta∣))O(BN(k+|W_\mathrm{sta}|)) Up to 2×\times memory reduction via alternation

Dense attention incurs quadratic compute/memory bottlenecks. SSTA mechanisms achieve nearly-linear complexity with respect to total context length, enabling scaling to hundreds of frames or 8K+ token contexts. Halving of KV-cache is achieved by alternating tile/global layers or separating mask storage (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025, Wu et al., 24 Nov 2025).

6. Hyperparameters, Implementation, and Empirical Results

Notable Choices (as reported across benchmarks):

  • Tile/block size: Video—(2 frames,16,16)(2~\mathrm{frames}, 16, 16) for N=512N=512. Text—block size B=16B=16–$32$.
  • Sliding window radius: Video—(1,1,1)(1,1,1) (i.e., 3×3×3=273\times3\times3 = 27 blocks). Text—window s=512s=512 tokens.
  • Top-kk selection: Video models k=32k=32 per head. Text models select Ksel=64K_\mathrm{sel}=64–$128$ per global layer.
  • Pooling: Adaptive average (video); block-reduction or small MLP (text).
  • Training: SSTA weights often initialized/distilled from dense-attention checkpoints to minimize quality degradation (Wu et al., 24 Nov 2025).

Performance Highlights

On 29-frame, 720p video generation, Efficient-vDiT with SSTA improves inference speed by 7.8×\times (from 9 min to under 1.3 min) with under 1% drop in VBench score (Ding et al., 10 Feb 2025). HunyuanVideo 1.5 achieves 1.87×1.87\times step speedup and reduces GPU memory by up to 20% on 121–241 frame 720p T2V, enabling single-4090 inference for 121-frame, 720p outputs (13.6GB peak) (Wu et al., 24 Nov 2025). For language, SSTA halves KV-cache memory, matches or exceeds full- and branchwise NSA on common-sense reasoning, retrieval, and long-context understanding tasks over Llama-like backbones (Hu et al., 2 Nov 2025).

7. Applications, Practical Impact, and Comparative Efficacy

SSTA has wide applicability in transformer architectures processing ultra-long contexts, specifically:

A plausible implication is that SSTA forms a principled, generalizable approach to sparse attention, blending adaptivity and locality, and is now integrated in state-of-the-art open-source systems.

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Selective and Sliding Tile Attention (SSTA).