TempCache: Temporal KV Cache Compression

Updated 3 February 2026

TempCache is a temporal key-value cache compression method that merges similar keys across video frames to bound memory and compute costs.
It leverages approximate nearest neighbor search to identify semantic redundancies, enabling efficient integration with standard Transformer attention mechanisms.
Empirical results demonstrate up to 10.8x speedup and stable GPU memory usage while preserving high visual quality metrics.

TempCache is a temporal key-value (KV) cache compression module introduced for efficient inference in autoregressive video diffusion models and world models. Its core function is to identify and merge semantically redundant attention keys across generated video frames, thereby bounding memory and compute costs as sequence length grows. TempCache is a training-free, kernel-agnostic technique and can be integrated as a preprocessing step for self- and cross-attention in existing Transformers, yielding substantial improvements in speed and resource utilization while maintaining visual quality (Samuel et al., 2 Feb 2026).

1. Motivation and Problem Addressed

In autoregressive video diffusion and sequential world modeling, the typical transformer-driven approach maintains a continually growing cache of keys and values from all previously generated frames. As generation proceeds, both per-frame latency (due to increasing attention context) and memory requirements (to store the entire cache) scale roughly linearly with the rollout length $T$ . This dynamic leads to bottlenecks in long-horizon or real-time applications, limiting feasible context windows and harming consistency over extended sequences. TempCache directly targets this scaling bottleneck by compressing the cache in a temporally-aware manner, thus stabilizing GPU memory and inference speed regardless of $T$ (Samuel et al., 2 Feb 2026).

2. Temporal Correspondence Mechanism

TempCache exploits the empirical observation that, in video, the representation (key vectors) of the same semantic entity (e.g., a moving object or region) often exhibits strong similarity across nearby frames. This redundancy motivates the search for "correspondences"—matching newly generated keys in the current frame to near-duplicates in the historical cache. To recover these correspondences efficiently at inference time, TempCache uses lightweight Approximate Nearest Neighbor (ANN) search techniques (such as LSH or quantized vector search) to locate, for each incoming key $k^{(t)}_i$ , its most similar existing key $\hat k_j$ in the compressed cache (Samuel et al., 2 Feb 2026). If the similarity exceeds a user-defined threshold $\tau$ , the key is merged into an existing group; otherwise, it becomes the representative of a new group.

3. Formal Mathematical Foundation

Let $Q \in \mathbb{R}^{N_q \times d}$ (queries), $K \in \mathbb{R}^{N_k \times d}$ (keys), $V \in \mathbb{R}^{N_k \times d_v}$ (values). Standard attention operates as:

$O = \mathrm{Softmax}\bigl(QK^\top/\sqrt d\bigr)\,V$

In standard autoregressive generation, the cache grows as:

$\hat K^{(t)} = \begin{bmatrix} K^{(1)} \ K^{(2)} \ \vdots \ K^{(t)} \end{bmatrix}, \qquad \hat V^{(t)} = \begin{bmatrix} V^{(1)} \ V^{(2)} \ \vdots \ V^{(t)} \end{bmatrix}$

leading to $|\hat K^{(t)}| \approx tN$ . TempCache replaces the cache with $g$ groups, where each group $G_\ell$ is a set of (approximately) redundant old keys. With representative key $k'_\ell$ and averaged value $\tilde v_\ell$ , attention can be approximated as:

$O = \mathrm{Softmax}\Bigl(Q(K')^\top/\sqrt d + \Delta\Bigr)\,\tilde V,$

where $K' = [k'_1;\dots;k'_g]$ , $\tilde V = [\tilde v_1;\dots;\tilde v_g]$ , and $\Delta_{i,\ell} = \log m_\ell$ (with $m_\ell = |G_\ell|$ ) serves as a bias for exactness (Samuel et al., 2 Feb 2026). In practice, for approximate redundancy, $\Delta$ can often be omitted with minimal impact.

4. Algorithmic Implementation

At inference, TempCache tracks groups of similar keys using a dynamic index and processes streaming frames as follows:

Key Similarity Search: For each new key $k_i$ , perform ANN search against existing group representatives to find the most similar $k'_\ell$ , and calculate similarity $s = \langle k_i, k'_\ell \rangle$ .
Merge or Create Group: If $s \ge \tau$ , merge $k_i$ and its value $v_i$ into group $G_\ell$ , updating the representative and averaged value. Otherwise, start a new group.
Cache Compression: At each attention call, replace the full cache with the representatives and value means of all current groups, maintaining the total number $g$ within a predetermined bound (empirically $\sim$ 300–500 per layer).
Efficient Index Updating: Update the ANN structure after each insertion, ensuring search operations remain low-latency.

Pseudocode and further implementation details are provided in (Samuel et al., 2 Feb 2026). The method is characterized by streaming, online operation and is fully compatible with standard transformer blocks.

5. Complexity and Hyperparameter Considerations

Complexity Analysis:

Scenario	Cache Size	Per-frame Attention Cost	Cumulative Compute over $T$
Without TempCache	$O(TN)$	$O(tN^2d)$ (grows with $t$ )	$O(N^2dT^2)$
With TempCache	$O(N)$ (bounded by $g$ )	$O(N^2d)$ (constant in $t$ )	$O(N^2dT)$ (linear in $T$ )

Key hyperparameters include the similarity threshold $\tau$ and the configuration of the ANN search (hash tables for LSH, codebook size for quantization). Tradeoffs involve balancing attention recall (fidelity to original computation) against aggressiveness of compression and computational overhead. The "last-key" policy (assigning the newest key as group representative) provided optimal attention recall under rolling refresh conditions (Samuel et al., 2 Feb 2026).

6. Empirical Results and Performance Impact

Experiments on long-horizon streaming video diffusion and world modeling tasks demonstrate:

KV density reduction: TempCache retains only $\sim$ 16% (quantized) or 33% (in world models) of original keys.
Attention recall: $\sim$ 91–92% of dense attention mass is captured post-compression.
Speedup: Standalone TempCache achieves 6.8–6.9 $\times$ wall-clock speedup vs. dense attention; up to 10.8 $\times$ when combined with additional sparse attention modules.
Memory scaling: Peak GPU memory usage remains virtually flat with TempCache, whereas dense methods increase linearly with sequence length.
Quality preservation: PSNR/SSIM/LPIPS metrics within $<$ 0.01 of baseline, and perceptual scores are statistically indistinguishable (Samuel et al., 2 Feb 2026).

7. Relation to Other Cache Reuse Techniques

TempCache differs fundamentally from cache-reuse acceleration schemes designed for diffusion sampling in image/video domains, such as TempCache-like local similarity methods in denoising-based diffusion transformers (Chu et al., 22 Aug 2025). Whereas some methods select steps for cache reuse based on pairwise output similarity—tending to cluster reuse near the end of the sampling process—TempCache in the autoregressive context leverages temporal correspondence across frames, compresses the attention context, and addresses memory/latency characteristics unique to streaming or world-model scenarios. This approach eliminates the progressive slowdown and memory blow-up that are intrinsic to classic autoregressive KV accumulation.

In summary, TempCache provides a mathematically grounded, efficient, and readily deployable solution to the challenges of scaling autoregressive video diffusion and world model inference, achieving consistent throughput and stable resource usage across arbitrarily long sequences with negligible degradation in output quality (Samuel et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026)

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TempCache.