Compress and Attend Transformer (CAT)

Updated 24 November 2025

The paper introduces the CAT paradigm, which decouples sequence modeling into chunk-wise compression and causal decoding to achieve efficient attention.
The approach leverages convolution-augmented layers and contract-and-broadcast attention to maintain high recall and robust generalization with reduced resource usage.
Empirical results demonstrate that CAT variants outperform traditional dense and sparse transformers in tasks like language modeling and long-context QA with significant speed and memory gains.

The Compress and Attend Transformer (CAT) paradigm encompasses a family of architectural innovations in Transformer models that integrate explicit sequence compression and/or architectural layer fusion to address the prohibitive compute and memory costs of standard attention, while retaining or even enhancing recall, generalization, and efficiency properties. Variants include the two-stage Compress & Attend Transformer (“CAT”) for adaptive quality/memory control in language modeling, convolution-augmented CAT layers for provable recall and generalization, interpretable contract/broadcast attention mechanisms, and fused compressed decoder architectures for efficient sequence transduction. This article reviews the formal definitions, algorithmic frameworks, computational trade-offs, theoretical guarantees, and empirical results for these Compress and Attend Transformer models, with reference to major papers.

1. Formal Architectures and Conceptual Foundations

1.1 Compress & Attend Transformer (Adaptive Chunks)

The principal CAT instantiation decouples the sequence modeling pipeline into (a) a parallel chunk-wise compressor, and (b) a causal decoder operating on compressed chunk representations. Given a sequence $H = \{x_1, ..., x_N\}$ :

The sequence is split into non-overlapping contiguous chunks $\{c_1, ..., c_{N_C}\}$ , each of size $C \ll N$ .
Each chunk $c_i$ is mapped to a single compressed embedding $z_i = f_\theta(c_i) \in \mathbb{R}^{D_g}$ by a bi-directional transformer compressor $f_\theta$ .
A causal transformer decoder $g_\theta$ autoregressively generates each $x_{i,j}$ within chunk $i$ conditioned on (1) localized intra-chunk context, and (2) all preceding compressed chunk embeddings $\{z_1, ..., z_{i-1}\}$ :

$\text{Attention}_\ell(U; Z) = \operatorname{softmax}(Q_\ell K_\ell^\top / \sqrt{d}) V_\ell$

where $U$ are token embeddings within the chunk so far, $Z$ is the matrix of all previous compressed codes, and $K_\ell, V_\ell$ concatenate raw and compressed contexts (Prakash et al., 7 Nov 2025).

1.2 Convolution-Augmented CAT Layers

An orthogonal formulation injects 1-D convolutional filters of width $W$ into the attention pipeline, modifiying Keys/Queries/Values:

$\bar{Q} = (X * F_q) W_q,~~ \bar{K} = (X * F_k) W_k,~~ \bar{V} = (X * F_v) W_v$

with $(X * F)_i = \sum_{j=0}^{W-1} F_j X_{i-j}$ . The transformed tokens are fed into a conventional self-attention mechanism, providing local-context “compression” in the embeddings before global attention (Li et al., 8 Jul 2024).

1.3 Contract-and-Broadcast Attention

Contract-and-Broadcast Self-Attention (CBSA) exploits a maximal coding rate reduction objective to compress all tokens into low-dimensional structures by contracting a small subset of representative tokens, then broadcasts this contraction back to reconstruct a global approximate attention update. For each head, a set of $m$ representatives is extracted by softmax-pooling; these are then self-attended and “broadcasted” to the full sequence via the contraction coefficients $A_k$ (Wen et al., 21 Sep 2025).

1.4 Compressed Decoders by Sublayer Fusion

The Compressed Attention Network approach merges the conventional residual self-attention, encoder-decoder cross-attention, and feedforward layers of the Transformer decoder into a single compressed block via matrix fusion, under assumptions on input similarity, yielding a single residual path and reducing per-layer latency (Li et al., 2021).

2. Mathematical Analysis and Theoretical Properties

2.1 Formal Definitions

For the two-stage chunked CAT (Prakash et al., 7 Nov 2025):

The decoder’s conditional distribution factorizes per chunk as:

$p_\theta(x_1...x_N) = \prod_{i=1}^{N_C}\prod_{j=1}^C p_\theta(x_{i,j} \mid x_{i,1:j-1}, f_\theta(c_{1:i-1}))$

Decoder attention keys/values are a concatenation of current-chunk tokens and the compressed representation of all prior chunks, using a custom causal mask to enforce correct autoregressive dependencies.

For convolutional CAT (Li et al., 8 Jul 2024):

The convolution acts as a context-wise summary:

$K̄_i = \sum_{j < W} F_{k,j} X_{i-j}$

with filter width $W$ parameterizing the degree of locality.

Landmark and block-based versions subsample or cluster convolved summaries to further reduce effective sequence length.

2.2 Computational Complexity

Model Variant	Training Compute	Inference Compute	KV Memory
Standard Transformer	$O(N^2 d)$	$O(N^2 d)$	$O(d N)$
CAT (Chunks of size $C$ )	$O(d N C + d N^2 / C)$	$O(d (N/C + C))$	$O(d (N/C + C))$
Landmark CAT	$O(d \sqrt{L})$ per query	$O(d \sqrt{L})$	Odepends on B
CBSA	$O(n d^2 + n m d + m^2 d)$	$O(n d^2 + n m d + m^2 d)$	$O(n m)$
CAN	$O(t s d)$ (for $t$ decoder, $s$ encoder tokens)	$O(t s d)$	Comparable to baseline

The chunk size $C$ , landmark block size $B$ , and number of representatives $m$ allow explicit control of the accuracy–efficiency trade-off.

3. Theoretical Guarantees and Proof Underpinnings

CAT with convolution-augmented attention provably solves universal data association and copying benchmarks:

Single-layer CAT with appropriate filters solves $N$ -gram associative recall (AR), given embeddings of dimension $d = \Omega(N)$ and generic convolution filters $F$ [(Li et al., 8 Jul 2024), Thm. 1].
The model supports length generalization: if a CAT solves AR up to length $L$ with error $\epsilon$ , it can generalize to length $L'$ with error $O(L' \epsilon)$ using the same parameters [(Li et al., 8 Jul 2024), Thm. 2].
Selective Copy tasks are solvable in one layer by infinite-support SSM filters or short filters plus positional encoding [(Li et al., 8 Jul 2024), Thm. 3].

In CBSA, the contraction-broadcast objective generates a gradient-based attention operator which subsumes full softmax, linear, kernelized, and channel attention as special cases, providing a white-box mechanism. For adequate $m$ (number of representatives), CBSA achieves comparable expressivity and accuracy at $O(n m)$ cost (Wen et al., 21 Sep 2025).

4. Empirical Performance and Comparative Analysis

4.1 Language Modeling and Recall

Compressed chunked CAT models match dense transformer perplexity on WikiText-103 and FineWeb, outperforming sparse and linear mixer baselines. In in-context recall (SWDE, FDA, 2K context), CAT with $C=4$ achieves 47% vs. dense 32% and hybrid 31%, far surpassing sparse and linear (9–13%). A single adaptive CAT trained for multiple $C$ values spans all these points and beats all baselines at the same compute/memory tradeoff (Prakash et al., 7 Nov 2025).

On long-context QA (LongBench, up to 4K), CAT variants outscore all efficient and dense baselines. In synthetic “needle-in-haystack” state tracking, CAT preserves >90% accuracy where others degrade.

4.2 Throughput, Memory, and Scalability

On 4K generation, CAT-4 yields 1.4 $\times$ speedup and 2.2 $\times$ reduction in KV cache, while CAT-32 achieves 3.2 $\times$ speed and 9.5 $\times$ memory reduction, with only modest drops in in-context recall. Scaling to 1B parameters (and above) preserves Chinchilla-scaling laws as in dense models (Prakash et al., 7 Nov 2025).

CBSA, evaluated on ImageNet-1K and ADE20k, is highly competitive: CBT-Small achieves 71.4% top-1 on ImageNet at 30% of ViT-Small’s parameters and 40% of its FLOPs; in semantic segmentation it outperforms Segmenter at one-fifth of the decoder compute (Wen et al., 21 Sep 2025).

Compressed Attention decoders (CAN) on WMT14 En–De/French/others achieve equivalent BLEU to balanced baselines with 1.42 $\times$ –2.8 $\times$ decoding speedup (Li et al., 2021).

5. Deployment, Adaptivity, and Practical Recommendations

The memory footprint for CAT at large context can be reduced by 4 $\times$ –9.5 $\times$ relative to dense attention by increasing $C$ (e.g., from 670GB to 160GB for a 14B-parameter model at max length; see (Prakash et al., 7 Nov 2025)). The model can be made adaptive at test time using a chunk-size indicator token without retraining, obviating the need for multiple specialized checkpoints.

Guidelines:

For maximal recall (QA, code, state tracking): prefer small $C$ (4–8).
For latency/batch throughput: prefer large $C$ (16–32).
Chunk and representative size directly interpolate between recall and resource utilization.

6. Architectural Comparisons and Limitations

In direct comparison:

Method	Free Access to Past	Flexible Memory	Parallel Training	Efficient Inference	Adaptive at Test
Dense Transformer	✓	✗	✓	✗	✗
Sparse/Sliding-Window	✗	✗	✓	✓	✗
Linear attention/SSMs	✓	✗	✓	✓	✗
Recurrent compression	✓	✓	✗	✓	✗
Block/MegaByte	✓	✗	✓	✓	✗
CAT (chunks)	✓	✓	✓	✓	✓

Sparse masking and linear/SSM methods either restrict recall and contextual access or require hybridization. Recurrent compressors are slow/hard to train. Block transformers bottleneck in a global token. Compress-and-attend approaches offer both adaptive quality-control and competitive or superior recall/throughput.

A small quality drop may occur in highly compressed or aggressive fusion layers, but this is minor ( $\leq 0.5$ BLEU for CAN, closed with distillation) (Li et al., 2021), or correctable with light fine-tuning (large-chunk CAT) (Prakash et al., 7 Nov 2025).

7. Future Directions and Significance

The compress-and-attend principle generalizes several efficient Transformer variants, offering a single unified mechanism for controlling compute/memory/quality tradeoffs without architectural hybrids or hand-crafted masks. The connection to provable associative mechanisms, white-box attention design, and scaling to vision, language, and sequence domains positions this approach as a foundational direction for the next generation of efficient foundation models (Li et al., 8 Jul 2024, Wen et al., 21 Sep 2025, Prakash et al., 7 Nov 2025, Li et al., 2021).

References: (Li et al., 8 Jul 2024, Prakash et al., 7 Nov 2025, Wen et al., 21 Sep 2025, Li et al., 2021)