Selective & Sliding Tile Attention (SSTA)
- Selective and Sliding Tile Attention (SSTA) is a family of sparse attention mechanisms that partitions inputs into tiles to capture key global and local features.
- SSTA combines selective attention to salient blocks with sliding-window attention for continuity, providing nearly linear scalability in transformers.
- Empirical results show SSTA enables significant speedups and memory savings in video diffusion transformers and long-context language models while maintaining high quality.
Selective and Sliding Tile Attention (SSTA) refers to a family of sparse attention mechanisms designed to achieve high computational efficiency and memory savings when scaling transformers to long video or text contexts. SSTA unifies two principles: sparse selection of salient blocks (tiles) to retain global context and a local static sliding-window to preserve short-range continuity. Foundational to SSTA is the empirical insight that, for both video and text, full attention matrices exhibit a strong tile/block structure with most significant attention mass concentrated along the block-diagonal and in a handful of global reference regions. SSTA mechanisms have been incorporated in large-scale video diffusion transformers (DiTs) and long-context LLMs to achieve close-to-linear scaling, substantial wallclock acceleration, and reduced memory, while maintaining or even exceeding the quality of dense-attention baselines (Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025).
1. Motivating Redundancy Patterns in Full Attention
In transformers operating over spatiotemporal (e.g., video) or long textual data, attention matrices naturally take the form of large, regularly partitioned blocks—"tiles." For a video with frames, each of size , the attention matrix () splits into tiles of size . Empirical analysis demonstrates that, after training, the heaviest attention weights aggregate along the main diagonal (tiles corresponding to within-frame interactions) and in a few global reference frames, while off-diagonal tiles are highly repetitive and low-importance. In text, similar block or sequence-local structure emerges, especially under architectures such as NSA which expose local and global context separately (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025). This suggests dense full attention computation is dominated by a small, structured subset.
2. Formal Definitions and Core SSTA Variants
Tile Partitioning
Tokens are mapped to 3D spatiotemporal coordinates (for video) or 1D sequence indices (for text). The full token set is partitioned into non-overlapping blocks ("tiles"):
- Video: Tiles span contiguous frames or spatial subregions: .
- Text: Tiles are contiguous subsequences of length .
Selective Tile Attention (STA)
STA restricts attention to a sparse subset of tiles:
- All main-diagonal tiles (self and within-frame/within-block).
- A fixed set of "global reference" blocks, either pre-selected or dynamically chosen by a scoring function.
The mask for query and key is:
The sparse attention is:
Sliding Tile Attention (SlTA)
To capture local context missed by global selection, SlTA augments STA by enabling attention within a sliding temporal or spatial window (e.g., adjacent frames or blocks). The corresponding mask is
(Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025, Hu et al., 2 Nov 2025)
3. Block-Level and Hybrid SSTA Implementations
SSTA can be unified and generalized beyond simple spatiotemporal tiles:
- Block Partition and Pooling: For arbitrary tensors, tokens are partitioned into blocks of size ; each block is pooled (e.g., adaptive average) to a -dim vector.
- Block Importance Scoring: Inter-block importance is computed via pooled features and redundancy measures; query blocks select their top- key blocks.
- Hybrid Local+Global Mask: The final mask is the intersection (logical and) of learned selective (top-) and static sliding window masks:
Efficient block-sparse kernels (e.g., flex_block_attention, CUDA-optimized) are used for implementation; SSTA is modular and can drop into standard transformer blocks (Wu et al., 24 Nov 2025).
4. Algorithmic Structure and Pseudocode
A typical forward pass for SSTA:
1 2 3 4 5 6 7 8 9 10 11 |
Q_b, K_b, V_b = split_into_blocks(Q, K, V, tile) # shape [h, B, N, d] Q̄ = adaptive_avg_pool(Q_b, out_size=1) # [h, B, d] K̄ = adaptive_avg_pool(K_b, out_size=1) # [h, B, d] S_s = einsum('h i d, h j d -> h i j', Q̄, K̄) # block similarity R = compute_redundancy(K_b) # intra-block redundancy S_i = λ * S_s - β * R # block importance scores idx = topk(S_i, k, dim=-1) # indices of selected blocks M_sel = index_to_mask(idx, B) # select mask [h, B, B] M_sta = make_sliding_mask(B, tile, window) # local mask [B, B] M = M_sel & M_sta.unsqueeze(0) # block mask [h, B, B] O = flex_block_attention(Q, K, V, block_mask=M) |
Language SSTA architectures alternate tile (sliding-window/MLA) and global (compression+selection/GLA) layers, further optimizing to halve the KV-cache footprint (Hu et al., 2 Nov 2025).
5. Complexity and Memory Analysis
| Mechanism | Compute Complexity | Memory (KV-cache) Reduction |
|---|---|---|
| Full Attention | or | None |
| STA | or | Linear in tokens () |
| SlTA | or | Linear in tokens () |
| SSTA (Hybrid) | or | Up to 2 memory reduction via alternation |
Dense attention incurs quadratic compute/memory bottlenecks. SSTA mechanisms achieve nearly-linear complexity with respect to total context length, enabling scaling to hundreds of frames or 8K+ token contexts. Halving of KV-cache is achieved by alternating tile/global layers or separating mask storage (Ding et al., 10 Feb 2025, Hu et al., 2 Nov 2025, Wu et al., 24 Nov 2025).
6. Hyperparameters, Implementation, and Empirical Results
Notable Choices (as reported across benchmarks):
- Tile/block size: Video— for . Text—block size –$32$.
- Sliding window radius: Video— (i.e., blocks). Text—window tokens.
- Top- selection: Video models per head. Text models select –$128$ per global layer.
- Pooling: Adaptive average (video); block-reduction or small MLP (text).
- Training: SSTA weights often initialized/distilled from dense-attention checkpoints to minimize quality degradation (Wu et al., 24 Nov 2025).
Performance Highlights
On 29-frame, 720p video generation, Efficient-vDiT with SSTA improves inference speed by 7.8 (from 9 min to under 1.3 min) with under 1% drop in VBench score (Ding et al., 10 Feb 2025). HunyuanVideo 1.5 achieves step speedup and reduces GPU memory by up to 20% on 121–241 frame 720p T2V, enabling single-4090 inference for 121-frame, 720p outputs (13.6GB peak) (Wu et al., 24 Nov 2025). For language, SSTA halves KV-cache memory, matches or exceeds full- and branchwise NSA on common-sense reasoning, retrieval, and long-context understanding tasks over Llama-like backbones (Hu et al., 2 Nov 2025).
7. Applications, Practical Impact, and Comparative Efficacy
SSTA has wide applicability in transformer architectures processing ultra-long contexts, specifically:
- Video diffusion transformers (Efficient-vDiT, HunyuanVideo 1.5) (Ding et al., 10 Feb 2025, Wu et al., 24 Nov 2025)
- Long-context LLMs with explicit alternation between tile/global layers (NSA and SSTA-enhanced) (Hu et al., 2 Nov 2025)
- Any large-scale model requiring efficient, scalable attention with minimal parameter/engineering overhead.
A plausible implication is that SSTA forms a principled, generalizable approach to sparse attention, blending adaptivity and locality, and is now integrated in state-of-the-art open-source systems.
References
- "Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile" (Ding et al., 10 Feb 2025)
- "HunyuanVideo 1.5 Technical Report" (Wu et al., 24 Nov 2025)
- "Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies" (Hu et al., 2 Nov 2025)