Papers
Topics
Authors
Recent
Search
2000 character limit reached

CacheFormer: Efficient Long-Context Transformer

Updated 7 January 2026
  • CacheFormer is a Transformer-based architecture that efficiently handles long input contexts by integrating dynamic cache retrieval with multiple attention modules.
  • It achieves linear time and memory complexity by combining short-window, compressed long, dynamic cache, and overlapping segment attentions to mitigate computational bottlenecks and fragmentation.
  • Empirical evaluations show up to a 10% reduction in perplexity on benchmark language modeling tasks, highlighting its practical benefits for long-context processing.

CacheFormer is a Transformer-based architecture specifically designed for efficient handling of long input contexts. It reduces the quadratic scaling typically associated with standard attention while delivering improved predictive quality, exemplified by a perplexity gain of ∼8–10% over comparable baselines on benchmark language modeling tasks. CacheFormer introduces a high attention-based segment caching strategy that dynamically retrieves selected uncompressed segments, augmenting compressed global interactions with precise context where needed. The architecture combines four distinct attention modules—short sliding-window, long compressed segmented, dynamically retrieved top-kk uncompressed, and overlapping compressed segmented attention—to achieve linear time and memory complexity, mitigating both computational bottlenecks and quality degradation for long input sequences (Singh et al., 18 Apr 2025).

1. Architectural Design and Segmentation

CacheFormer operates on a token sequence X∈Rn×dX\in\mathbb R^{n\times d} and employs a dual segmentation scheme:

  • Short-window segments of size ww serve local attention via a sliding-window mechanism.
  • Long segments of size ss facilitate global context aggregation.
  • To counter segment boundary fragmentation, overlapping long segments are defined with a stride of s/2s/2.

For each Transformer layer, four parallel attention modules are constructed (sharing the same query projection Q=XWQQ = XW^Q), each utilizing different key/value preparations:

  1. Short sliding-window attention (AsA_s): Attends to a local context of size $2w$ per token.
  2. Compressed segmented long-range attention (AℓA_\ell): Attends globally using a dynamically compressed representation with span r≪nr \ll n.
  3. Dynamic cache attention (AcA_c): Retrieves and attends to the top-kk most relevant segments (plus u−1u-1 neighbors), uncompressed, determined by high segment-level attention scores.
  4. Overlapping compressed segmented attention (AoA_o): Performs global compressed attention over overlapping segments to reduce boundary fragmentation.

The outputs are aggregated by summing the two compressed attentions (Aâ„“+AoA_\ell + A_o), then concatenating this sum with short-range (AsA_s) and cache (AcA_c) attentions along the key/value dimension. The resulting per-head attention is:

Aenhanced=[As    ,    (Aℓ+Ao)    ,    Ac]∈Rn×(2w+r+k u s)A_{\rm enhanced} = \bigl[A_s\;\;,\;\;(A_\ell + A_o)\;\;,\;\;A_c\bigr] \in \mathbb{R}^{n\times (2w + r + k\,u\,s)}

This is followed by standard projection and feedforward processing as in vanilla Transformers.

2. Detailed Composition of Attention Mechanisms

Each attention variant is formulated for efficient and targeted context retrieval:

  • Short Sliding-Window Attention (AsA_s): For position tt, keys and values for $2w$ adjacent tokens (with boundary zero-padding) are used. Computational cost is O(nw)O(nw), providing uncompressed, local context affinity.
  • Compressed Segmented Long Attention (Aâ„“A_\ell): The sequence is partitioned into m=n/sm = n/s disjoint segments, each of which is projected down (via a dynamic projection matrix PP) to produce a total compressed length rr. This enables global context interaction with O(nr)O(nr) complexity.
  • Dynamic Cache Attention (AcA_c): Driven by the pre-softmax logits in Aâ„“A_\ell, the algorithm identifies, per query region, the kk most highly attended compressed segments (using the â„“2\ell_2 or RMS norm as scalar score, averaged over non-overlapping windows of pp tokens). Each top-kk segment is supplemented with u−1u-1 neighbors, and the full uncompressed keys and values from these segments are retrieved for precise attention computation—restoring otherwise lost detail.
  • Overlapping Compressed Segmented Attention (AoA_o): Overlapping segments of size ss and stride s/2s/2 (zero-padded as needed) are compressed similarly, to address context spanning segment boundaries and reduce fragmentation effects.

A pseudocode sketch for the cache retrieval algorithm is given in the data. The key steps are block reshaping, top-kk index selection per averaging window, neighbor retrieval, stacking of uncompressed keys/values, and computation of final attention.

3. Complexity and Computational Analysis

The overall computational complexity per layer in CacheFormer is determined by:

  • Short-window attention: O(nw)O(nw)
  • Compressed long attention: O(nr)O(nr)
  • Overlapping compressed attention: O(nr)O(nr)
  • Cache attention: O(nkus)O(nkus)

Total complexity is:

O(nw+nr+nkus)O\bigl(nw + nr + nkus\bigr)

Given w, r, kus≪nw,\, r,\, kus \ll n, this scaling is essentially linear in nn, substantially improving over the O(n2)O(n^2) cost of full attention. While the model incurs a larger constant factor compared to Longformer, Linformer, and related sparse/linear architectures (owing to the aggregation of four different attentions), it preserves high quality by fetching exact tokens for the most relevant segments as determined dynamically (Singh et al., 18 Apr 2025).

4. Overlapping Segment Strategy and Fragmentation Mitigation

CacheFormer employs overlapping long segments, with a stride of s/2s/2, to address the problem of context fragmentation at segment boundaries. For segment indices [0..s−1], [s..2s−1],…[0..s-1],\ [s..2s-1],\ldots (non-overlapping), the overlapping indices are [0..s−1], [s/2..3s/2−1], [s..2s−1],…[0..s-1],\ [s/2..3s/2-1],\ [s..2s-1],\ldots, with padding as needed. These overlapped segments are compressed using the same projection as the disjoint segments. This arrangement ensures that information spanning segment edges is fully captured in at least one compressed representation, thus improving the continuity of learned dependencies.

5. Empirical Evaluation and Performance

CacheFormer was evaluated on Wikitext-103 (perplexity) and enwik8 (bits-per-character, BPC) using models with identical parameters to the baseline Long-Short Transformer (12 layers, 12 heads, d=768d=768, ~122M parameters):

Wikitext-103 (sequence length 1024, w=128w=128, s=16s=16, r=256r=256)

Model Variant Perplexity
Baseline Long-Short Transformer 23.74
+ Overlap only 23.47
+ Cache only (k=7,u=1k=7, u=1) 21.67
+ Cache + Overlap (k=7,u=1k=7, u=1) 21.32

The best performance (perplexity 21.32) was obtained with k=7,u=1k=7, u=1 (∼10% improvement). Ablation on cache size revealed: (k=3,u=1)→23.31(k=3, u=1)\rightarrow 23.31; (k=5,u=1)→22.75(k=5, u=1)\rightarrow 22.75; (k=7,u=1)→21.32(k=7, u=1)\rightarrow 21.32; (k=5,u=3)→21.26(k=5, u=3)\rightarrow 21.26. The optimal trade-off was achieved at k=7,u=1k=7, u=1.

enwik8 (BPC, 23M and 34.9M parameter models)

Model Baseline + Cache (k=7,u=1k=7, u=1)
23M 1.192 1.188
34.9M 1.173 1.167

BPC improvements were modest, representing better prediction but less impact on compression efficiency.

6. Insights, Limitations, and Future Work

CacheFormer is most effective in scenarios where long-range dependencies in input sequences are distributed heterogeneously—benefiting from the ability to dynamically cache and attend to the most critical segments. Empirically, dynamic cache attention yields the majority of the observed quality gains, while overlapping segment attention provides a consistent, smaller benefit.

A limitation is the added computational and implementation overhead from the dynamic cache retrieval; a recommended practical training regimen is to pretrain without the cache and fine-tune with it. Proposed directions include exploring hierarchical multi-level caching (to scale to even longer sequences), leveraging caching to further reduce overall model size (notably in large-scale LMs), and adaptively tuning kk and uu per layer or token position.

7. Summary and Context in the Literature

CacheFormer extends the Transformer architecture by introducing a multi-component attention mechanism oriented toward efficient and high-quality long-context modeling. Its design draws from established principles in caching and virtual memory, dynamically retrieving exact input tokens once high attention to their compressed representations is detected. In contrast to other sparse and compressed-attention models (Linformer, Longformer, Performer, SSMs), CacheFormer hybridizes local, global-compressed, dynamic cache, and overlap-based mechanisms, demonstrating linear complexity and improved benchmark perplexity within identical parameter and training budgets (Singh et al., 18 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CacheFormer.