Papers
Topics
Authors
Recent
Search
2000 character limit reached

TempCache: Temporal KV Cache Compression

Updated 3 February 2026
  • TempCache is a temporal key-value cache compression method that merges similar keys across video frames to bound memory and compute costs.
  • It leverages approximate nearest neighbor search to identify semantic redundancies, enabling efficient integration with standard Transformer attention mechanisms.
  • Empirical results demonstrate up to 10.8x speedup and stable GPU memory usage while preserving high visual quality metrics.

TempCache is a temporal key-value (KV) cache compression module introduced for efficient inference in autoregressive video diffusion models and world models. Its core function is to identify and merge semantically redundant attention keys across generated video frames, thereby bounding memory and compute costs as sequence length grows. TempCache is a training-free, kernel-agnostic technique and can be integrated as a preprocessing step for self- and cross-attention in existing Transformers, yielding substantial improvements in speed and resource utilization while maintaining visual quality (Samuel et al., 2 Feb 2026).

1. Motivation and Problem Addressed

In autoregressive video diffusion and sequential world modeling, the typical transformer-driven approach maintains a continually growing cache of keys and values from all previously generated frames. As generation proceeds, both per-frame latency (due to increasing attention context) and memory requirements (to store the entire cache) scale roughly linearly with the rollout length TT. This dynamic leads to bottlenecks in long-horizon or real-time applications, limiting feasible context windows and harming consistency over extended sequences. TempCache directly targets this scaling bottleneck by compressing the cache in a temporally-aware manner, thus stabilizing GPU memory and inference speed regardless of TT (Samuel et al., 2 Feb 2026).

2. Temporal Correspondence Mechanism

TempCache exploits the empirical observation that, in video, the representation (key vectors) of the same semantic entity (e.g., a moving object or region) often exhibits strong similarity across nearby frames. This redundancy motivates the search for "correspondences"—matching newly generated keys in the current frame to near-duplicates in the historical cache. To recover these correspondences efficiently at inference time, TempCache uses lightweight Approximate Nearest Neighbor (ANN) search techniques (such as LSH or quantized vector search) to locate, for each incoming key ki(t)k^{(t)}_i, its most similar existing key k^j\hat k_j in the compressed cache (Samuel et al., 2 Feb 2026). If the similarity exceeds a user-defined threshold τ\tau, the key is merged into an existing group; otherwise, it becomes the representative of a new group.

3. Formal Mathematical Foundation

Let QRNq×dQ \in \mathbb{R}^{N_q \times d} (queries), KRNk×dK \in \mathbb{R}^{N_k \times d} (keys), VRNk×dvV \in \mathbb{R}^{N_k \times d_v} (values). Standard attention operates as:

O=Softmax(QK/d)VO = \mathrm{Softmax}\bigl(QK^\top/\sqrt d\bigr)\,V

In standard autoregressive generation, the cache grows as:

K^(t)=[K(1) K(2)  K(t)],V^(t)=[V(1) V(2)  V(t)]\hat K^{(t)} = \begin{bmatrix} K^{(1)} \ K^{(2)} \ \vdots \ K^{(t)} \end{bmatrix}, \qquad \hat V^{(t)} = \begin{bmatrix} V^{(1)} \ V^{(2)} \ \vdots \ V^{(t)} \end{bmatrix}

leading to K^(t)tN|\hat K^{(t)}| \approx tN. TempCache replaces the cache with gg groups, where each group GG_\ell is a set of (approximately) redundant old keys. With representative key kk'_\ell and averaged value v~\tilde v_\ell, attention can be approximated as:

O=Softmax(Q(K)/d+Δ)V~,O = \mathrm{Softmax}\Bigl(Q(K')^\top/\sqrt d + \Delta\Bigr)\,\tilde V,

where K=[k1;;kg]K' = [k'_1;\dots;k'_g], V~=[v~1;;v~g]\tilde V = [\tilde v_1;\dots;\tilde v_g], and Δi,=logm\Delta_{i,\ell} = \log m_\ell (with m=Gm_\ell = |G_\ell|) serves as a bias for exactness (Samuel et al., 2 Feb 2026). In practice, for approximate redundancy, Δ\Delta can often be omitted with minimal impact.

4. Algorithmic Implementation

At inference, TempCache tracks groups of similar keys using a dynamic index and processes streaming frames as follows:

  1. Key Similarity Search: For each new key kik_i, perform ANN search against existing group representatives to find the most similar kk'_\ell, and calculate similarity s=ki,ks = \langle k_i, k'_\ell \rangle.
  2. Merge or Create Group: If sτs \ge \tau, merge kik_i and its value viv_i into group GG_\ell, updating the representative and averaged value. Otherwise, start a new group.
  3. Cache Compression: At each attention call, replace the full cache with the representatives and value means of all current groups, maintaining the total number gg within a predetermined bound (empirically \sim300–500 per layer).
  4. Efficient Index Updating: Update the ANN structure after each insertion, ensuring search operations remain low-latency.

Pseudocode and further implementation details are provided in (Samuel et al., 2 Feb 2026). The method is characterized by streaming, online operation and is fully compatible with standard transformer blocks.

5. Complexity and Hyperparameter Considerations

Complexity Analysis:

Scenario Cache Size Per-frame Attention Cost Cumulative Compute over TT
Without TempCache O(TN)O(TN) O(tN2d)O(tN^2d) (grows with tt) O(N2dT2)O(N^2dT^2)
With TempCache O(N)O(N) (bounded by gg) O(N2d)O(N^2d) (constant in tt) O(N2dT)O(N^2dT) (linear in TT)

Key hyperparameters include the similarity threshold τ\tau and the configuration of the ANN search (hash tables for LSH, codebook size for quantization). Tradeoffs involve balancing attention recall (fidelity to original computation) against aggressiveness of compression and computational overhead. The "last-key" policy (assigning the newest key as group representative) provided optimal attention recall under rolling refresh conditions (Samuel et al., 2 Feb 2026).

6. Empirical Results and Performance Impact

Experiments on long-horizon streaming video diffusion and world modeling tasks demonstrate:

  • KV density reduction: TempCache retains only \sim16% (quantized) or 33% (in world models) of original keys.
  • Attention recall: \sim91–92% of dense attention mass is captured post-compression.
  • Speedup: Standalone TempCache achieves 6.8–6.9×\times wall-clock speedup vs. dense attention; up to 10.8×\times when combined with additional sparse attention modules.
  • Memory scaling: Peak GPU memory usage remains virtually flat with TempCache, whereas dense methods increase linearly with sequence length.
  • Quality preservation: PSNR/SSIM/LPIPS metrics within <<0.01 of baseline, and perceptual scores are statistically indistinguishable (Samuel et al., 2 Feb 2026).

7. Relation to Other Cache Reuse Techniques

TempCache differs fundamentally from cache-reuse acceleration schemes designed for diffusion sampling in image/video domains, such as TempCache-like local similarity methods in denoising-based diffusion transformers (Chu et al., 22 Aug 2025). Whereas some methods select steps for cache reuse based on pairwise output similarity—tending to cluster reuse near the end of the sampling process—TempCache in the autoregressive context leverages temporal correspondence across frames, compresses the attention context, and addresses memory/latency characteristics unique to streaming or world-model scenarios. This approach eliminates the progressive slowdown and memory blow-up that are intrinsic to classic autoregressive KV accumulation.

In summary, TempCache provides a mathematically grounded, efficient, and readily deployable solution to the challenges of scaling autoregressive video diffusion and world model inference, achieving consistent throughput and stable resource usage across arbitrarily long sequences with negligible degradation in output quality (Samuel et al., 2 Feb 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TempCache.