AnnCA: Accelerating Cross-Attention in Video Diffusion
- AnnCA is an approximate nearest-neighbor cross-attention method that prunes redundant prompt tokens to reduce computational load in video diffusion models.
- It builds online ANN indices using LSH or product quantization to select the most relevant tokens, effectively lowering cross-attention complexity from O(Nq Np d) to O(Nq k d).
- Empirical results demonstrate up to 2.3× speedup with stable GPU memory usage across long video generation, while preserving near-identical visual quality.
AnnCA
AnnCA refers to Approximate Nearest-Neighbor Cross-Attention, a training-free method for accelerating cross-attention in transformer-based autoregressive video diffusion models by selectively pruning prompt tokens via approximate nearest neighbor (ANN) matching (Samuel et al., 2 Feb 2026). The method addresses the intensive compute and memory bottlenecks associated with dense cross-attention in streaming or long-form autoregressive generation, enabling large temporal context and efficient scaling without sacrificing output quality or requiring model retraining.
1. Motivation and Problem Setting
In autoregressive video diffusion and world modeling architectures, each generation step produces latent queries for the current frame that must attend to a (potentially large) set of prompt tokens encoding high-level context (e.g., text, image embeddings, or multimodal tokens) through cross-attention layers. Conventional methods use full cross-attention, with every query in the generated frame attending all prompt tokens, leading to compute and memory per frame (: queries, : prompt tokens, : dimension). Empirically, only a subset of prompt tokens significantly influences any particular frame, resulting in extensive redundancy.
AnnCA is designed to address this redundancy by efficiently identifying and retaining only those prompt tokens most relevant to the latent queries for the current frame, thereby bounding computational complexity and memory growth during long rollouts.
2. Algorithmic Structure
AnnCA applies an online ANN matching protocol to select frame-relevant prompt tokens for cross-attention computation:
- Index Construction:
- Build an ANN index (e.g., via locality-sensitive hashing [LSH] or product quantization [PQ]) on the keys for the full prompt token set.
- For LSH, prompt keys are projected into random hyperplanes and bucketed; with PQ, keys are quantized into cells in subspaces.
- Query Matching:
- For each latent query ( total for a frame), compute its hash code(s) or PQ cell.
- Probe the corresponding index bucket(s) for candidate nearest prompt tokens.
- Optionally re-rank candidates by exact inner product to select top- neighbors per query.
- Token Accumulation:
- Collect the union of all candidate prompt tokens retrieved across all queries; deduplicate to obtain the final index subset of size .
- Attention Execution:
- Extract and by selecting only rows indexed by .
- Replace full cross-attention with attention over reduced .
- Fallback:
- If no candidate is found for any query (rare), default to dense cross-attention.
Hyperparameters include the number of hash tables (), hash bits (), PQ subspaces (), probe buckets per query (), number of neighbors per query (), and the distance metric (inner product/cosine).
3. Mathematical Formulation and Complexity Analysis
Standard full cross-attention is defined by:
with , as above.
With AnnCA, only the subset of prompt tokens is used:
and compute:
The complexity per frame is reduced from to for the sparse attention, with additional for ANN index construction/retrieval. Empirical measurements show (LSH)–$0.25$ (PQ), corresponding to a 3–4 reduction in cross-attention computation.
4. Empirical Performance and Memory Scaling
Extensive benchmarking on long-form video generation tasks (e.g., Rolling-Forcing LongVBench, 3000-frame inference) demonstrates that AnnCA:
- Reduces cross-attention density to ~30% of the dense baseline.
- Achieves 2.2–2.3 reduction in end-to-end inference time (AnnCA-LSH, AnnCA-Quant).
- Flatlines peak GPU memory usage across long rollouts (holding at ~12 GB for 3000 frames with AnnCA+AnnSA+TempCache), while standard dense attention grows linearly (~20 GB for FA3 backbone).
- Preserves near-identical visual quality (min-density, max-recall metrics ≥90%) (Samuel et al., 2 Feb 2026).
The following table summarizes key per-frame results:
| Method | Cross-Attn Density (%) | Max Recall (%) | Speedup (×) |
|---|---|---|---|
| Dense FA3 | 100 | 100 | 1.0 |
| AnnCA-LSH | 33.1 | 94.2 | 2.2 |
| AnnCA-Quant | 29.5 | 91.1 | 2.3 |
AnnCA serves as a primary contributor (~2–2.3×) to the total speedup, with further gains from TempCache (temporal cache compression for self-attention) and AnnSA (ANN-based sparse self-attention) in the unified pipeline.
5. Implementation and Integration
AnnCA is applied as an inference-time module and does not require model retraining or architectural modifications to the autoregressive backbone. In each generation step, the following process is executed for every transformer block:
- TempCache compresses historical self-attention key/value caches.
- AnnSA sparsifies intra-frame self-attention.
- AnnCA prunes prompt tokens for inter-frame (cross) attention using the above ANN-driven algorithm.
- Cross-attention is computed with the reduced prompt set.
All ANN indices can be constructed online per inference batch and optionally shared across layers for efficiency. The method is compatible with both FA3 and comparable autoregressive diffusion world models.
6. Design Trade-offs and Considerations
Several operational trade-offs and open research directions are noted:
- Recall vs. Density: Aggressive pruning (low ) can increase speed but may slightly degrade fine details due to lower recall; hyperparameter tuning of ANN index parameters is required to balance efficiency against quality.
- Index Maintenance: The overhead of building or updating the ANN index is small compared to the cost of dense attention, but may become significant with extremely large prompt sets.
- Extension Potential:
- Adaptive selection based on attention-mass feedback.
- Use of graph-based ANN (e.g., HNSW) for higher recall at constant .
- Integration with dynamic or trainable prompt key projections for improved semantic clustering.
- Fusing ANN selection into single GPU kernels to reduce overhead.
- Exploration in multilingual or large multi-modal prompt regimes.
Empirical results indicate that even with fixed hyperparameters, AnnCA is robust and achieves its primary goal of bounding cross-attention growth while maintaining competitive fidelity, making it suitable for real-time, long-horizon, and interactive video generation scenarios.
7. Broader Impact and Relation to Adjacent Work
AnnCA represents a principled, training-free approach to scalable attention in sequential generative modeling, focusing on direct reduction of redundancy in cross-attention computations for temporally extended tasks. The core mechanisms—fast ANN index construction and semantically guided token selection—are applicable to other domains where selective context aggregation is required (e.g., large-context language modeling, memory-augmented reasoning, multi-modal retrieval).
AnnCA should not be conflated with similarly named neural or abnormal component analysis frameworks (NCA, ACA (Zhao, 2017, Valla et al., 2023)) or with neural correspondence analysis (AnnCA as in (Hsu et al., 2018)). In the context of streaming video diffusion, AnnCA refers specifically to the approximate nearest-neighbor cross-attention sparsification module as described and evaluated in (Samuel et al., 2 Feb 2026).