Papers
Topics
Authors
Recent
Search
2000 character limit reached

AnnSA: Sparse Self-Attention in Video Diffusion

Updated 3 February 2026
  • AnnSA is an approximate nearest neighbor self-attention mechanism that sparsifies computations by selecting semantically relevant keys from the cache.
  • It employs efficient ANN techniques like LSH and product quantization to reduce time and memory complexity from quadratic to sublinear scaling in autoregressive video diffusion models.
  • The mechanism achieves up to 10.8× speedup and near-constant GPU memory during long video rollouts while maintaining high visual fidelity.

AnnSA—Approximate Nearest Neighbor Self-Attention—is a sparse attention mechanism designed to accelerate long-form autoregressive video diffusion and neural world model rollouts by limiting self-attention computations to a dynamically selected, semantically relevant subset of cached keys. AnnSA leverages lightweight, training-free approximate nearest neighbor (ANN) index structures to perform fast sublinear retrieval of temporally relevant context, substantially reducing computational and memory costs while maintaining visual fidelity over long sequences (Samuel et al., 2 Feb 2026).

1. Motivation and Context

In state-of-the-art autoregressive video diffusion models, each generated video frame is conditioned on all previous frames via self-attention: at every generation step, queries for the current frame attend to keys and values from previous frames, requiring computation and cache growth that scale quadratically or worse with sequence length. Given TT generated frames and NN spatial tokens per frame, standard self-attention requires O((TN)2d)O((TN)^2 d) multiply-adds and O(TNd)O(TN d) memory per layer, leading to severe scaling bottlenecks that restrict the feasible temporal context window and limit long-range consistency. AnnSA emerges as a direct response to these challenges by sparsifying the self-attention computation through fast, approximate neighbor selection, providing a framework that is compatible with existing video diffusion backbones and does not require retraining or modification of core weights (Samuel et al., 2 Feb 2026).

2. Core Algorithm and Data Structures

AnnSA replaces dense (all-to-all) self-attention with a sparse variant. For each query qiq_i at the current timestep, only the KTNK \ll TN nearest keys from the accumulated key–value (KV) cache are included in the attention calculation. The process consists of:

  • Distance metric: Use 2\ell_2-norm qikj2\|q_i - k_j\|_2 or negative dot-product qikj-q_i^\top k_j to quantify similarity between the current query qiq_i and cached keys kjk_j.
  • Approximate nearest neighbor search: Efficient sublinear retrieval of the top-KK neighbors N(qi)\mathcal{N}(q_i) using either
    • Locality-sensitive hashing (LSH): Project vectors into a low-dimensional subspace, hash them into buckets via random hyperplanes, and select nearest bucket contents.
    • Product quantization (PQ): Quantize vectors into low-bit (e.g., 8-bit) codes, optionally indexing via HNSW or inverted lists.
  • Sparse attention computation: For each qiq_i, form the set N(qi)\mathcal{N}(q_i) of KK candidates and compute the attention logits and outputs restricted to these:

ij=qikjd,αij=exp(ij)jN(qi)exp(ij),oi=jN(qi)αijvj\ell_{ij} = \frac{q_i^\top k_j}{\sqrt{d}}, \quad \alpha_{ij} = \frac{\exp(\ell_{ij})}{\sum_{j' \in \mathcal{N}(q_i)} \exp(\ell_{ij'})}, \quad o_i = \sum_{j \in \mathcal{N}(q_i)} \alpha_{ij} v_j

The ANN index is updated incrementally as new frames are generated.

3. Complexity and Efficiency Analysis

The table below summarizes key scaling properties comparing dense self-attention and AnnSA implementation, as detailed in (Samuel et al., 2 Feb 2026):

Attention Type Time Complexity Memory Complexity Per-step Cost (K TN\ll TN)
Dense (full) O(T2N2d)O(T^2N^2d) (total) O(TNd)O(TNd) O(TNd)O(TNd)
AnnSA sparse (LSH/PQ) O(TNdlogT)O(TNd\log T) (total) O(TNd+TNb)O(TNd\,+\,TNb) O(N(Kd+ANNsearch))O(N(Kd+\mathrm{ANN}_{\mathrm{search}}))
  • With practical values K=816K = 8\text{--}16 (neighbor count), AnnSA reduces self-attention density to \sim27%, pruning 70%+ of computations.
  • Peak GPU memory remains nearly constant over long rollouts due to the pruning of inactive keys (via companion TempCache), in contrast to linear cache growth in the dense case.
  • End-to-end system speedups of $5.1$–10.8×10.8\times relative to dense FlashAttention-3 were reported for 3000-frame rollouts, with nearly identical PSNR, SSIM, and LPIPS metrics (Samuel et al., 2 Feb 2026).

4. Integration with Autoregressive Diffusion Backbones

AnnSA is agnostic to specific autoregressive video diffusion architectures and can be integrated as follows:

  1. For each output frame tt, compute the new keys KtK_t and values VtV_t.
  2. Append new (Kt,Vt)(K_t, V_t) to the cache and update the ANN index (LSH tables, PQ codes, etc).
  3. Optionally prune near-duplicate or unused cache entries using TempCache, reducing memory further.
  4. During self-attention, select KK approximate nearest neighbors per query via the ANN index and perform sparse attention using a block-sparse kernel (e.g., FlashInfer).
  5. All operations are training-free: existing model weights remain unchanged, and no additional gradient-based learning is required for the attention mechanism.

5. Hyperparameters and Trade-offs

The key tunable parameters in AnnSA, their influence, and recommended setting strategies are:

  • KK (neighbor count per query): Smaller KK increases speed/sparsity but may degrade recall if too few keys are selected; typical values $8$–$16$ yield >90%>90\% attention recall.
  • LSH parameters: Number of tables LL, number of bits per hash BB, and bucket width. More tables/bits yield higher recall but greater index overhead.
  • Quantization bit-width bb: Lower bb yields faster search but may degrade match quality (e.g., b=8b=8 achieves $0.80$ recall at 3×3\times faster throughput than b=32b=32).
  • Similarity/distance thresholds: Tighter thresholds increase sparsity but risk missing true neighbors, impacting visual or temporal fidelity.

Ablation studies reveal negligible perceptual degradation until KK falls below $8$; aggressive pruning may impact fine detail and global consistency.

6. Empirical Results

Quantitative experiments using the LongVBench suite demonstrate:

  • AnnSA-LSH: Self-attention density 27.6%27.6\%, attention recall 92.4%92.4\%, PSNR $25.73$, SSIM $0.688$, LPIPS $0.142$, speedup 5.1×5.1\times over dense attention.
  • AnnSA-Quant: Similar results, with speedup 5.2×5.2\times.
  • Full system (with TempCache, AnnCA): Speedup $10.7$–10.8×10.8\times, nearly constant GPU memory across long rollouts, and no visible quality loss compared to dense baseline.

Qualitatively, generated videos remain temporally consistent and high-fidelity, with sparsity-induced artifacts becoming apparent only in extreme settings (very low KK or aggressive pruning) (Samuel et al., 2 Feb 2026).

7. Limitations and Applicability

  • Approximate neighbor selection: Aggressive sparsity or loose thresholds may omit crucial long-range context, leading to temporal drift or loss of global structure.
  • Index overhead: The cost of building/updating the ANN index and switching to a block-sparse attention kernel incurs constant overhead; for very short sequences, dense attention may remain preferable.
  • Worst-case scaling: If all KV pairs are highly distinct, little computational gain is possible (reduction to near-dense behavior).
  • Hyperparameter tuning: Empirical tuning is required per backbone and application domain; default settings offer robust performance for most autoregressive video diffusion tasks.
  • Training-free limitation: AnnSA is not a learned mechanism; its effectiveness depends on the statistical redundancy and local similarity present in learned KV representations.

AnnSA provides a scalable, architecture-agnostic mechanism for sparsifying self-attention in autoregressive video models, enabling high-throughput long-form generation, efficient inference, and tractable memory footprints without retraining or accuracy compromise (Samuel et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnnSA.