AnnSA: Sparse Self-Attention in Video Diffusion

Updated 3 February 2026

AnnSA is an approximate nearest neighbor self-attention mechanism that sparsifies computations by selecting semantically relevant keys from the cache.
It employs efficient ANN techniques like LSH and product quantization to reduce time and memory complexity from quadratic to sublinear scaling in autoregressive video diffusion models.
The mechanism achieves up to 10.8× speedup and near-constant GPU memory during long video rollouts while maintaining high visual fidelity.

AnnSA—Approximate Nearest Neighbor Self-Attention—is a sparse attention mechanism designed to accelerate long-form autoregressive video diffusion and neural world model rollouts by limiting self-attention computations to a dynamically selected, semantically relevant subset of cached keys. AnnSA leverages lightweight, training-free approximate nearest neighbor (ANN) index structures to perform fast sublinear retrieval of temporally relevant context, substantially reducing computational and memory costs while maintaining visual fidelity over long sequences (Samuel et al., 2 Feb 2026).

1. Motivation and Context

In state-of-the-art autoregressive video diffusion models, each generated video frame is conditioned on all previous frames via self-attention: at every generation step, queries for the current frame attend to keys and values from previous frames, requiring computation and cache growth that scale quadratically or worse with sequence length. Given $T$ generated frames and $N$ spatial tokens per frame, standard self-attention requires $O((TN)^2 d)$ multiply-adds and $O(TN d)$ memory per layer, leading to severe scaling bottlenecks that restrict the feasible temporal context window and limit long-range consistency. AnnSA emerges as a direct response to these challenges by sparsifying the self-attention computation through fast, approximate neighbor selection, providing a framework that is compatible with existing video diffusion backbones and does not require retraining or modification of core weights (Samuel et al., 2 Feb 2026).

2. Core Algorithm and Data Structures

AnnSA replaces dense (all-to-all) self-attention with a sparse variant. For each query $q_i$ at the current timestep, only the $K \ll TN$ nearest keys from the accumulated key–value (KV) cache are included in the attention calculation. The process consists of:

Distance metric: Use $\ell_2$ -norm $\|q_i - k_j\|_2$ or negative dot-product $-q_i^\top k_j$ to quantify similarity between the current query $q_i$ and cached keys $k_j$ .
Approximate nearest neighbor search: Efficient sublinear retrieval of the top- $K$ $K$ neighbors $\mathcal{N}(q_i)$ $N (q_{i})$ using either
- Locality-sensitive hashing (LSH): Project vectors into a low-dimensional subspace, hash them into buckets via random hyperplanes, and select nearest bucket contents.
- Product quantization (PQ): Quantize vectors into low-bit (e.g., 8-bit) codes, optionally indexing via HNSW or inverted lists.
Sparse attention computation: For each $q_i$ , form the set $\mathcal{N}(q_i)$ of $K$ candidates and compute the attention logits and outputs restricted to these:

$\ell_{ij} = \frac{q_i^\top k_j}{\sqrt{d}}, \quad \alpha_{ij} = \frac{\exp(\ell_{ij})}{\sum_{j' \in \mathcal{N}(q_i)} \exp(\ell_{ij'})}, \quad o_i = \sum_{j \in \mathcal{N}(q_i)} \alpha_{ij} v_j$

The ANN index is updated incrementally as new frames are generated.

3. Complexity and Efficiency Analysis

The table below summarizes key scaling properties comparing dense self-attention and AnnSA implementation, as detailed in (Samuel et al., 2 Feb 2026):

Attention Type	Time Complexity	Memory Complexity	Per-step Cost (K $\ll TN$ )
Dense (full)	$O(T^2N^2d)$ (total)	$O(TNd)$	$O(TNd)$
AnnSA sparse (LSH/PQ)	$O(TNd\log T)$ (total)	$O(TNd\,+\,TNb)$	$O(N(Kd+\mathrm{ANN}_{\mathrm{search}}))$

With practical values $K = 8\text{--}16$ (neighbor count), AnnSA reduces self-attention density to $\sim$ 27%, pruning 70%+ of computations.
Peak GPU memory remains nearly constant over long rollouts due to the pruning of inactive keys (via companion TempCache), in contrast to linear cache growth in the dense case.
End-to-end system speedups of $5.1$– $10.8\times$ relative to dense FlashAttention-3 were reported for 3000-frame rollouts, with nearly identical PSNR, SSIM, and LPIPS metrics (Samuel et al., 2 Feb 2026).

4. Integration with Autoregressive Diffusion Backbones

AnnSA is agnostic to specific autoregressive video diffusion architectures and can be integrated as follows:

For each output frame $t$ , compute the new keys $K_t$ and values $V_t$ .
Append new $(K_t, V_t)$ to the cache and update the ANN index (LSH tables, PQ codes, etc).
Optionally prune near-duplicate or unused cache entries using TempCache, reducing memory further.
During self-attention, select $K$ approximate nearest neighbors per query via the ANN index and perform sparse attention using a block-sparse kernel (e.g., FlashInfer).
All operations are training-free: existing model weights remain unchanged, and no additional gradient-based learning is required for the attention mechanism.

5. Hyperparameters and Trade-offs

The key tunable parameters in AnnSA, their influence, and recommended setting strategies are:

$K$ (neighbor count per query): Smaller $K$ increases speed/sparsity but may degrade recall if too few keys are selected; typical values $8$–$16$ yield $>90\%$ attention recall.
LSH parameters: Number of tables $L$ , number of bits per hash $B$ , and bucket width. More tables/bits yield higher recall but greater index overhead.
Quantization bit-width $b$ : Lower $b$ yields faster search but may degrade match quality (e.g., $b=8$ achieves $0.80$ recall at $3\times$ faster throughput than $b=32$ ).
Similarity/distance thresholds: Tighter thresholds increase sparsity but risk missing true neighbors, impacting visual or temporal fidelity.

Ablation studies reveal negligible perceptual degradation until $K$ falls below $8$; aggressive pruning may impact fine detail and global consistency.

6. Empirical Results

Quantitative experiments using the LongVBench suite demonstrate:

AnnSA-LSH: Self-attention density $27.6\%$ , attention recall $92.4\%$ , PSNR $25.73$, SSIM $0.688$, LPIPS $0.142$, speedup $5.1\times$ over dense attention.
AnnSA-Quant: Similar results, with speedup $5.2\times$ .
Full system (with TempCache, AnnCA): Speedup $10.7$– $10.8\times$ , nearly constant GPU memory across long rollouts, and no visible quality loss compared to dense baseline.

Qualitatively, generated videos remain temporally consistent and high-fidelity, with sparsity-induced artifacts becoming apparent only in extreme settings (very low $K$ or aggressive pruning) (Samuel et al., 2 Feb 2026).

7. Limitations and Applicability

Approximate neighbor selection: Aggressive sparsity or loose thresholds may omit crucial long-range context, leading to temporal drift or loss of global structure.
Index overhead: The cost of building/updating the ANN index and switching to a block-sparse attention kernel incurs constant overhead; for very short sequences, dense attention may remain preferable.
Worst-case scaling: If all KV pairs are highly distinct, little computational gain is possible (reduction to near-dense behavior).
Hyperparameter tuning: Empirical tuning is required per backbone and application domain; default settings offer robust performance for most autoregressive video diffusion tasks.
Training-free limitation: AnnSA is not a learned mechanism; its effectiveness depends on the statistical redundancy and local similarity present in learned KV representations.

AnnSA provides a scalable, architecture-agnostic mechanism for sparsifying self-attention in autoregressive video models, enabling high-throughput long-form generation, efficient inference, and tractable memory footprints without retraining or accuracy compromise (Samuel et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnnSA.

AnnSA: Sparse Self-Attention in Video Diffusion

1. Motivation and Context

2. Core Algorithm and Data Structures

3. Complexity and Efficiency Analysis

4. Integration with Autoregressive Diffusion Backbones

5. Hyperparameters and Trade-offs

6. Empirical Results

7. Limitations and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AnnSA: Sparse Self-Attention in Video Diffusion

1. Motivation and Context

2. Core Algorithm and Data Structures

3. Complexity and Efficiency Analysis

4. Integration with Autoregressive Diffusion Backbones

5. Hyperparameters and Trade-offs

6. Empirical Results

7. Limitations and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research