Selective & Sliding Tile Attention (SSTA)

Updated 26 November 2025

Selective and Sliding Tile Attention (SSTA) is a family of sparse attention mechanisms that partitions inputs into tiles to capture key global and local features.
SSTA combines selective attention to salient blocks with sliding-window attention for continuity, providing nearly linear scalability in transformers.
Empirical results show SSTA enables significant speedups and memory savings in video diffusion transformers and long-context language models while maintaining high quality.

Selective and Sliding Tile Attention (SSTA) refers to a family of sparse attention mechanisms designed to achieve high computational efficiency and memory savings when scaling transformers to long video or text contexts. SSTA unifies two principles: sparse selection of salient blocks (tiles) to retain global context and a local static sliding-window to preserve short-range continuity. Foundational to SSTA is the empirical insight that, for both video and text, full attention matrices exhibit a strong tile/block structure with most significant attention mass concentrated along the block-diagonal and in a handful of global reference regions. SSTA mechanisms have been incorporated in large-scale video diffusion transformers (DiTs) and long-context LLMs to achieve close-to-linear scaling, substantial wallclock acceleration, and reduced memory, while maintaining or even exceeding the quality of dense-attention baselines (Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025).

1. Motivating Redundancy Patterns in Full Attention

In transformers operating over spatiotemporal (e.g., video) or long textual data, attention matrices naturally take the form of large, regularly partitioned blocks—"tiles." For a video with $T$ frames, each of size $H \times W$ , the $N \times N$ attention matrix ( $N = T\cdot H\cdot W$ ) splits into $T\times T$ tiles of size $(H\cdot W)\times (H\cdot W)$ . Empirical analysis demonstrates that, after training, the heaviest attention weights aggregate along the main diagonal (tiles corresponding to within-frame interactions) and in a few global reference frames, while off-diagonal tiles are highly repetitive and low-importance. In text, similar block or sequence-local structure emerges, especially under architectures such as NSA which expose local and global context separately (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025). This suggests dense full attention computation is dominated by a small, structured subset.

2. Formal Definitions and Core SSTA Variants

Tile Partitioning

Tokens are mapped to 3D spatiotemporal coordinates (for video) or 1D sequence indices (for text). The full token set is partitioned into non-overlapping blocks ("tiles"):

Video: Tiles span contiguous frames or spatial subregions: $(t_\mathrm{tile}, h_\mathrm{tile}, w_\mathrm{tile})$ .
Text: Tiles are contiguous subsequences of length $B$ .

Selective Tile Attention (STA)

STA restricts attention to a sparse subset of tiles:

All main-diagonal tiles (self and within-frame/within-block).
A fixed set of $k$ "global reference" blocks, either pre-selected or dynamically chosen by a scoring function.

The mask for query $i$ and key $j$ is:

$M_\mathrm{select}(i, j) = \begin{cases} 0 & \text{if } t_j = t_i \text{ or } t_j \in G \ -\infty & \text{otherwise} \end{cases}$

The sparse attention is:

$\text{Attention}_\mathrm{sel}(Q, K, V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} + M_\mathrm{select} \right)V$

(Ding et al., 10 Feb 2025)

Sliding Tile Attention (SlTA)

To capture local context missed by global selection, SlTA augments STA by enabling attention within a sliding temporal or spatial window (e.g., $w$ adjacent frames or blocks). The corresponding mask is

$M_\mathrm{slide}(i, j) = \begin{cases} 0 & \text{if } t_j = t_i \text{ or } |t_j - t_i| \leq \lfloor w/2 \rfloor \text{ or } t_j \in G \ -\infty & \text{otherwise} \end{cases}$

(Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025)

3. Block-Level and Hybrid SSTA Implementations

SSTA can be unified and generalized beyond simple spatiotemporal tiles:

Block Partition and Pooling: For arbitrary tensors, tokens are partitioned into blocks of size $N$ ; each block is pooled (e.g., adaptive average) to a $d$ -dim vector.
Block Importance Scoring: Inter-block importance is computed via pooled features and redundancy measures; query blocks select their top- $k$ key blocks.
Hybrid Local+Global Mask: The final mask is the intersection (logical and) of learned selective (top- $k$ ) and static sliding window masks:

$M_\mathrm{comb} = M_\mathrm{sel} \land M_\mathrm{sta}$

Efficient block-sparse kernels (e.g., flex_block_attention, CUDA-optimized) are used for implementation; SSTA is modular and can drop into standard transformer blocks (Wu et al., 24 Nov 2025).

4. Algorithmic Structure and Pseudocode

A typical forward pass for SSTA:

Q_b, K_b, V_b = split_into_blocks(Q, K, V, tile)    # shape [h, B, N, d]
Q̄ = adaptive_avg_pool(Q_b, out_size=1)             # [h, B, d]
K̄ = adaptive_avg_pool(K_b, out_size=1)             # [h, B, d]
S_s = einsum('h i d, h j d -> h i j', Q̄, K̄)       # block similarity
R   = compute_redundancy(K_b)                       # intra-block redundancy
S_i = λ * S_s - β * R                               # block importance scores
idx = topk(S_i, k, dim=-1)                          # indices of selected blocks
M_sel = index_to_mask(idx, B)                       # select mask [h, B, B]
M_sta = make_sliding_mask(B, tile, window)          # local mask [B, B]
M = M_sel & M_sta.unsqueeze(0)                      # block mask [h, B, B]
O = flex_block_attention(Q, K, V, block_mask=M)

(Wu et al., 24 Nov 2025)

Language SSTA architectures alternate tile (sliding-window/MLA) and global (compression+selection/GLA) layers, further optimizing to halve the KV-cache footprint (Hu et al., 2 Nov 2025).

5. Complexity and Memory Analysis

Mechanism	Compute Complexity	Memory (KV-cache) Reduction
Full Attention	$O(N^2)$ or $O(L^2)$	None
STA	$O(N\,k)$ or $O(BNk)$	Linear in tokens ( $k\ll N$ )
SlTA	$O(N\,w)$ or $O(BNw)$	Linear in tokens ( $w\ll N$ )
SSTA (Hybrid)	$O(N(k+w))$ or $O(BN(k+\|W_\mathrm{sta}\|))$	Up to 2 $\times$ memory reduction via alternation

Dense attention incurs quadratic compute/memory bottlenecks. SSTA mechanisms achieve nearly-linear complexity with respect to total context length, enabling scaling to hundreds of frames or 8K+ token contexts. Halving of KV-cache is achieved by alternating tile/global layers or separating mask storage (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025, Wu et al., 24 Nov 2025).

6. Hyperparameters, Implementation, and Empirical Results

Notable Choices (as reported across benchmarks):

Tile/block size: Video— $(2~\mathrm{frames}, 16, 16)$ for $N=512$ . Text—block size $B=16$ –$32$.
Sliding window radius: Video— $(1,1,1)$ (i.e., $3\times3\times3 = 27$ blocks). Text—window $s=512$ tokens.
Top- $k$ selection: Video models $k=32$ per head. Text models select $K_\mathrm{sel}=64$ –$128$ per global layer.
Pooling: Adaptive average (video); block-reduction or small MLP (text).
Training: SSTA weights often initialized/distilled from dense-attention checkpoints to minimize quality degradation (Wu et al., 24 Nov 2025).

Performance Highlights

On 29-frame, 720p video generation, Efficient-vDiT with SSTA improves inference speed by 7.8 $\times$ (from 9 min to under 1.3 min) with under 1% drop in VBench score (Ding et al., 10 Feb 2025). HunyuanVideo 1.5 achieves $1.87\times$ step speedup and reduces GPU memory by up to 20% on 121–241 frame 720p T2V, enabling single-4090 inference for 121-frame, 720p outputs (13.6GB peak) (Wu et al., 24 Nov 2025). For language, SSTA halves KV-cache memory, matches or exceeds full- and branchwise NSA on common-sense reasoning, retrieval, and long-context understanding tasks over Llama-like backbones (Hu et al., 2 Nov 2025).

7. Applications, Practical Impact, and Comparative Efficacy

SSTA has wide applicability in transformer architectures processing ultra-long contexts, specifically:

Video diffusion transformers (Efficient-vDiT, HunyuanVideo 1.5) (Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025)
Long-context LLMs with explicit alternation between tile/global layers (NSA and SSTA-enhanced) (Hu et al., 2 Nov 2025)
Any large-scale model requiring efficient, scalable attention with minimal parameter/engineering overhead.

A plausible implication is that SSTA forms a principled, generalizable approach to sparse attention, blending adaptivity and locality, and is now integrated in state-of-the-art open-source systems.

References

"Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile" (Ding et al., 10 Feb 2025)
"HunyuanVideo 1.5 Technical Report" (Wu et al., 24 Nov 2025)
"Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies" (Hu et al., 2 Nov 2025)

PDF Markdown Chat (Pro)

References (3)

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile (2025)

HunyuanVideo 1.5 Technical Report (2025)

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Selective and Sliding Tile Attention (SSTA).

Selective & Sliding Tile Attention (SSTA)

1. Motivating Redundancy Patterns in Full Attention

2. Formal Definitions and Core SSTA Variants

Tile Partitioning

Selective Tile Attention (STA)

Sliding Tile Attention (SlTA)

3. Block-Level and Hybrid SSTA Implementations

4. Algorithmic Structure and Pseudocode

5. Complexity and Memory Analysis

6. Hyperparameters, Implementation, and Empirical Results

Notable Choices (as reported across benchmarks):

Performance Highlights

7. Applications, Practical Impact, and Comparative Efficacy

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Selective & Sliding Tile Attention (SSTA)

1. Motivating Redundancy Patterns in Full Attention

2. Formal Definitions and Core SSTA Variants

Tile Partitioning

Selective Tile Attention (STA)

Sliding Tile Attention (SlTA)

3. Block-Level and Hybrid SSTA Implementations

4. Algorithmic Structure and Pseudocode

5. Complexity and Memory Analysis

6. Hyperparameters, Implementation, and Empirical Results

Notable Choices (as reported across benchmarks):

Performance Highlights

7. Applications, Practical Impact, and Comparative Efficacy

References

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research