TempCache: Temporal KV Cache Compression
- TempCache is a temporal key-value cache compression method that merges similar keys across video frames to bound memory and compute costs.
- It leverages approximate nearest neighbor search to identify semantic redundancies, enabling efficient integration with standard Transformer attention mechanisms.
- Empirical results demonstrate up to 10.8x speedup and stable GPU memory usage while preserving high visual quality metrics.
TempCache is a temporal key-value (KV) cache compression module introduced for efficient inference in autoregressive video diffusion models and world models. Its core function is to identify and merge semantically redundant attention keys across generated video frames, thereby bounding memory and compute costs as sequence length grows. TempCache is a training-free, kernel-agnostic technique and can be integrated as a preprocessing step for self- and cross-attention in existing Transformers, yielding substantial improvements in speed and resource utilization while maintaining visual quality (Samuel et al., 2 Feb 2026).
1. Motivation and Problem Addressed
In autoregressive video diffusion and sequential world modeling, the typical transformer-driven approach maintains a continually growing cache of keys and values from all previously generated frames. As generation proceeds, both per-frame latency (due to increasing attention context) and memory requirements (to store the entire cache) scale roughly linearly with the rollout length . This dynamic leads to bottlenecks in long-horizon or real-time applications, limiting feasible context windows and harming consistency over extended sequences. TempCache directly targets this scaling bottleneck by compressing the cache in a temporally-aware manner, thus stabilizing GPU memory and inference speed regardless of (Samuel et al., 2 Feb 2026).
2. Temporal Correspondence Mechanism
TempCache exploits the empirical observation that, in video, the representation (key vectors) of the same semantic entity (e.g., a moving object or region) often exhibits strong similarity across nearby frames. This redundancy motivates the search for "correspondences"—matching newly generated keys in the current frame to near-duplicates in the historical cache. To recover these correspondences efficiently at inference time, TempCache uses lightweight Approximate Nearest Neighbor (ANN) search techniques (such as LSH or quantized vector search) to locate, for each incoming key , its most similar existing key in the compressed cache (Samuel et al., 2 Feb 2026). If the similarity exceeds a user-defined threshold , the key is merged into an existing group; otherwise, it becomes the representative of a new group.
3. Formal Mathematical Foundation
Let (queries), (keys), (values). Standard attention operates as:
In standard autoregressive generation, the cache grows as:
leading to . TempCache replaces the cache with groups, where each group is a set of (approximately) redundant old keys. With representative key and averaged value , attention can be approximated as:
where , , and (with ) serves as a bias for exactness (Samuel et al., 2 Feb 2026). In practice, for approximate redundancy, can often be omitted with minimal impact.
4. Algorithmic Implementation
At inference, TempCache tracks groups of similar keys using a dynamic index and processes streaming frames as follows:
- Key Similarity Search: For each new key , perform ANN search against existing group representatives to find the most similar , and calculate similarity .
- Merge or Create Group: If , merge and its value into group , updating the representative and averaged value. Otherwise, start a new group.
- Cache Compression: At each attention call, replace the full cache with the representatives and value means of all current groups, maintaining the total number within a predetermined bound (empirically 300–500 per layer).
- Efficient Index Updating: Update the ANN structure after each insertion, ensuring search operations remain low-latency.
Pseudocode and further implementation details are provided in (Samuel et al., 2 Feb 2026). The method is characterized by streaming, online operation and is fully compatible with standard transformer blocks.
5. Complexity and Hyperparameter Considerations
Complexity Analysis:
| Scenario | Cache Size | Per-frame Attention Cost | Cumulative Compute over |
|---|---|---|---|
| Without TempCache | (grows with ) | ||
| With TempCache | (bounded by ) | (constant in ) | (linear in ) |
Key hyperparameters include the similarity threshold and the configuration of the ANN search (hash tables for LSH, codebook size for quantization). Tradeoffs involve balancing attention recall (fidelity to original computation) against aggressiveness of compression and computational overhead. The "last-key" policy (assigning the newest key as group representative) provided optimal attention recall under rolling refresh conditions (Samuel et al., 2 Feb 2026).
6. Empirical Results and Performance Impact
Experiments on long-horizon streaming video diffusion and world modeling tasks demonstrate:
- KV density reduction: TempCache retains only 16% (quantized) or 33% (in world models) of original keys.
- Attention recall: 91–92% of dense attention mass is captured post-compression.
- Speedup: Standalone TempCache achieves 6.8–6.9 wall-clock speedup vs. dense attention; up to 10.8 when combined with additional sparse attention modules.
- Memory scaling: Peak GPU memory usage remains virtually flat with TempCache, whereas dense methods increase linearly with sequence length.
- Quality preservation: PSNR/SSIM/LPIPS metrics within 0.01 of baseline, and perceptual scores are statistically indistinguishable (Samuel et al., 2 Feb 2026).
7. Relation to Other Cache Reuse Techniques
TempCache differs fundamentally from cache-reuse acceleration schemes designed for diffusion sampling in image/video domains, such as TempCache-like local similarity methods in denoising-based diffusion transformers (Chu et al., 22 Aug 2025). Whereas some methods select steps for cache reuse based on pairwise output similarity—tending to cluster reuse near the end of the sampling process—TempCache in the autoregressive context leverages temporal correspondence across frames, compresses the attention context, and addresses memory/latency characteristics unique to streaming or world-model scenarios. This approach eliminates the progressive slowdown and memory blow-up that are intrinsic to classic autoregressive KV accumulation.
In summary, TempCache provides a mathematically grounded, efficient, and readily deployable solution to the challenges of scaling autoregressive video diffusion and world model inference, achieving consistent throughput and stable resource usage across arbitrarily long sequences with negligible degradation in output quality (Samuel et al., 2 Feb 2026).