Compress and Attend Transformer (CAT)

Updated 16 April 2026

The paper demonstrates that CAT models achieve efficiency by compressing large representations for memory and attention, reducing quadratic scaling.
It employs various techniques like chunk compression, convolutional augmentation, and low-rank factorization to balance compute quality and speed.
Empirical results show up to 3× speedup and 2–9× memory savings, maintaining high recall and competitive performance on language and vision tasks.

A Compress and Attend Transformer (CAT) refers to a broad family of architectures that achieve computational efficiency and controllable trade-offs in Transformers by compressing the representation or memory before the attention/recall step, or by augmenting standard attention mechanisms with local filters and compression bottlenecks. The CAT paradigm appears in a variety of forms—including chunk compression, memory contraction, convolutional augmentation, and head-level low-rank factorization—with the consistent aim of scaling Transformers to longer contexts or higher throughput with minimal accuracy loss and maximal architectural simplicity.

1. Foundational Concepts and Motivations

The primary motivation behind CAT architectures is to mitigate the quadratic scaling bottleneck of dense self-attention with respect to input sequence length, especially in settings with long contexts. Standard approaches—sparse attention, sliding windows, or fixed-size recurrent states—tend to introduce either recall failures or complex coupling with dense attention layers. CAT frameworks instead (a) compress the context into summary representations, (b) attend over this compressed memory, and (c) use chunk size or compression granularity as an explicit knob at test time for compute-quality trade-off (Prakash et al., 7 Nov 2025, Karami et al., 29 Dec 2025).

Variations also include augmenting Transformer's key/query/value projections with convolutional filters for local context aggregation, or applying direct low-rank factorization and blockwise contraction-and-broadcast steps to the attention process (Li et al., 2024, Wen et al., 21 Sep 2025, Xiao et al., 2023).

2. Architectural Variants

Compress and Attend Transformer architectures span several implementation strategies:

2.1 Chunk Compression and Causal Attending

CAT (Prakash et al., 7 Nov 2025) divides the sequence into fixed-length chunks of $C$ tokens, compresses each chunk via a parallel non-causal encoder into a vector, and autoregressively generates each chunk by attending to the stream of compressed chunk vectors and intra-chunk tokens. The decoding of chunk $i$ and token $j$ sees:

$x_{i,1..j-1}$ (preceding tokens in chunk)
$\{f_\theta(c_1),...,f_\theta(c_{i-1})\}$ (preceding compressed chunks)

Performance and memory scale as $O(L^2/C)$ (training) and $O((L/C + C)d)$ (inference), with explicit control over trade-off by chunk size $C$ . Training with a mixture of $C$ and a learned indicator/token enables adaptive switching across memory/compute budgets at test time. Results show little to no degradation (perplexity, recall) down to modest $C$ , with 1.4–3x speedup and 2–9x memory reduction vs. dense Transformer (Prakash et al., 7 Nov 2025).

2.2 Dynamic Memory Compression

Trellis (Karami et al., 29 Dec 2025) replaces the unbounded KV cache with small fixed-size memories $i$ 0 ( $i$ 1) updated recursively using online gradient descent on per-step reconstruction losses. The two-pass compression incorporates a key and value at each step, with a forget-gated update:

$i$ 2

where $i$ 3 is squared error reconstruction plus $i$ 4 regularization. At inference, only the $i$ 5 memories are updated online; full model parameters remain frozen. This yields overall $i$ 6 time/memory complexity. Trellis outperforms both quadratic and recent linear/recurrent baselines in language modeling and recall-centric tasks for long contexts (Karami et al., 29 Dec 2025).

2.3 Convolutional Compression and Block Attention

Convolution Augmented Transformer (CAT) Editor's term: Conv-CAT incorporates 1D convolutional filters into Q/K/V projections, thereby encoding local context before attention:

$i$ 7

Attention proceeds as $i$ 8. To scale further, a block compress-and-attend variant downsamples K to $i$ 9 landmarks via convolution, computes hard attention to find the relevant block, and then performs full attention locally on $j$ 0 tokens. This achieves $j$ 1 complexity with theoretical guarantees of exact recall (associative recall, copying) and perfect length generalization when block size and embedding dimensionality are chosen appropriately (Li et al., 2024).

2.4 Low-Rank Product Factorization

COMCAT (Xiao et al., 2023) compresses pre-trained ViTs by factorizing not individual projection matrices but the products $j$ 2 and $j$ 3 at the multi-head level:

$j$ 4

Auto-ML (Gumbel-Softmax) is used for per-layer, per-head rank selection under budget constraints, followed by low-rank fine-tuning. This approach recovers or even improves accuracy with 40–60% parameter/FLOPs reduction.

2.5 Contract-and-Broadcast Compression

CBSA (Wen et al., 21 Sep 2025) unrolls interpretable, efficient attention via a contract-and-broadcast optimization:

Contract: Compress $j$ 5 tokens into $j$ 6 representatives per subspace/head using cross-attention and self-attend them.
Broadcast: Re-inject contracted information to all tokens using the same coefficients. CBSA generalizes existing schemes (full attention, Linformer, Performer, agent attention) as special cases depending on the choice of contraction, projection, and representative extraction.

3. Mathematical Foundations and Theoretical Guarantees

CAT variants justify their compression and attention steps under mild or empirically validated assumptions:

Identical-input assumption (CAN): Adjacent sublayers receive near-identical inputs (cosine similarity $j$ 7), justifying layer collapse (Li et al., 2021).
Linearity of projections: Sequential projections can be merged.
Compression-quality trade-off: By exposing the model to a range of chunk or memory sizes during training (e.g., $j$ 8), a single CAT model supports online tuning of recall vs. efficiency with no retraining (Prakash et al., 7 Nov 2025).
Provable recall & generalization: Conv-CAT can solve associative recall and copying with a single layer and exhibits length generalization provably superior to standard attention with positional encoding (Li et al., 2024).
Optimal memory allocation: Trellis demonstrates that forget-gate parameterization and curriculum on chunk/context size prevent information bottlenecks and instabilities in long-range modeling (Karami et al., 29 Dec 2025).
Interpretability and efficiency: The contract-and-broadcast formulation of CBSA provides a direct connection between coding-theoretic compression and linear-scaling attention mechanisms (Wen et al., 21 Sep 2025).

4. Empirical Results and Benchmarking

Compress and Attend Transformer models consistently outperform both dense and other "efficient" architectures in key benchmarks:

Model	Inference Speedup	Memory Savings	Typical Benchmark Results	Notable Features
CAT	1.4–3.2×	2–9×	LAMBADA, Wikitext, SWDE (beats dense/sparse/linear/SSM baselines)	Adaptive trade-off at test time
Trellis	—	O(1) memory	Outperforms Transformer++ (long context), RULER, common-sense tasks	Online memory compression
COMCAT	1.5–2.6×	up to 61%	DeiT-base/ImageNet (improves top-1), diffusion FID	Head-level low-rank factorization
Conv-CAT	—	—	Achieves perfect recall/copy, better LM perplexity/Lambada accuracy than LLAMA++, Mamba	1D convolution + attention
CBSA	30–40% FLOPs	—	71.4% top-1 (CBT-Small) vs 72.4% (ViT-S), emergent segmentation	Principled contract/broadcast

Performance gains on tasks favoring in-context recall or long-range reasoning appear especially strong for CAT-class methods, with recall/accuracy approaching or exceeding dense attention at far lower compute/memory cost (Prakash et al., 7 Nov 2025, Karami et al., 29 Dec 2025, Xiao et al., 2023, Li et al., 2024, Wen et al., 21 Sep 2025).

5. Applications Across Modalities and Domains

Language modeling: Reductions in O(L²) cost enable competitive perplexity and recall over multi-thousand-token contexts with practical acceleration and memory reduction (Prakash et al., 7 Nov 2025, Karami et al., 29 Dec 2025, Li et al., 2024).
Vision (classification, detection, diffusion): COMCAT and CBSA demonstrate that memory-efficient ViT architectures retain or improve accuracy (ImageNet, ADE20K) and enable lower-cost diffusion personalization with negligible storage/training overhead (Xiao et al., 2023, Wen et al., 21 Sep 2025).
One-shot learning: CAT modules for cross-attention (bi-directional) coupled with feature compression achieve superior AP and FPS over prior one-shot detection baselines on COCO, VOC, and FSOD (Lin et al., 2021).
Machine translation: Collapsing Transformer decoder sublayers into a compressed block yields a 1.3–1.4× speedup over strong baselines with ≤0.5 BLEU loss (14 WMT tasks) (Li et al., 2021).

6. Comparative Perspective and Connections

CAT models unify and systematize a spectrum of techniques:

Blockwise/chunkwise attention (CAT) generalizes windowed/sparse/mega-block attention, exposing a trade-off knob not available in fixed architectures.
Dynamic-compression models (Trellis, CBSA) integrate learned, data-dependent memory allocation and interpretable contracting.
Low-rank and convolutionally-augmented variants (COMCAT, Conv-CAT) recover classical linear/kernel or SSM layers as compositionally simple special cases.
All CAT approaches decouple the trade-off between computational efficiency and recall accuracy, enabling a single model to adapt across diverse application budgets and requirements.

7. Future Directions and Open Questions

Open challenges include adapting CAT-style architectures to extreme context lengths (hundreds of thousands or more), fine-grained granular adaptation at the sub-chunk or token level, integration with retrieval-augmented or external memory modules, and rigorous analysis of trade-offs in catastrophic forgetting vs. long-span recall under varied data distributions. A plausible implication is that continued advances in chunk-wise compression, memory allocation, and catastrophic forgetting prevention may yield even more scalable and flexible efficient-attention models suitable for deployment in both real-time and foundation model contexts.

Key implementations and codebases (e.g., for CAT, Trellis, COMCAT, CBSA) are available in the original papers for reproducibility and further investigation (Prakash et al., 7 Nov 2025, Karami et al., 29 Dec 2025, Xiao et al., 2023, Wen et al., 21 Sep 2025).