Papers
Topics
Authors
Recent
2000 character limit reached

Attention and Compression is all you need for Controllably Efficient Language Models (2511.05313v1)

Published 7 Nov 2025 in cs.LG

Abstract: The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose Compress & Attend Transformer (CAT), a conceptually simple architecture employing two simple ingredients only: dense attention and compression. CAT decodes chunks of tokens by attending to compressed chunks of the sequence so far. Compression results in decoding from a reduced sequence length that yields compute and memory savings, while choosing a particular chunk size trades-off quality for efficiency. Moreover, CAT can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. In exhaustive evaluations on common language modeling tasks, in-context recall, and long-context understanding, a single adaptive CAT model outperforms existing efficient baselines, including hybrid architectures, across different compute-memory budgets. Further, a single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.

Summary

  • The paper introduces a novel architecture that leverages adaptive chunking and learnable compression to controllably trade memory and compute for quality.
  • It employs training with randomized chunk sizes, allowing a single model to dynamically adjust efficiency-quality trade-offs at test time without retraining.
  • Empirical results show that cat outperforms dense transformers in in-context recall and generation speed, achieving up to 3× faster throughput with lower memory usage.

Compress and Attend Transformers (cat): A Controllably Efficient Alternative to Dense Transformers

Introduction and Motivation

The paper presents the Compress and Attend Transformer (cat), an architecture that addresses the quadratic computational and linear memory scaling of self-attention in dense transformers, particularly as applied to long-context LLMs. Unlike prior efficiency-focused architectures—such as sparse attention, sliding window methods, and linear recurrent mixers—which fix the quality-efficiency trade-off and often require heuristic design or complex recurrent update rules, cat leverages two fundamental concepts: dense attention and parallelizable compression. Cat provides a structured yet flexible mechanism to trade memory and compute for quality by chunking the sequence, compressing chunks with a learnable transformer-based compressor, and decoding with a dense transformer decoder that attends to compressed context. This yields adaptive control of efficiency/quality at test time, with a single model capable of spanning the quality-efficiency trade-off spectrum without retraining.

Notably, cat can be trained with multiple chunk sizes in a single run, unlocking post hoc test-time control. Empirical results show that cat achieves strong language modeling and reasoning performance, efficient scaling with model size, and superior in-context recall at memory-parity relative to other efficient or hybrid approaches. Intriguingly, in several practical regimes, cat even outperforms dense transformers in recall accuracy while using significantly less memory and offering increased throughput. Figure 1

Figure 1: cat unlocks test-time control of quality-efficiency trade-offs, where a single adaptive cat model outperforms major efficient architectures on in-context recall across compute-memory budgets.

Architecture and Implementation Details

At the core of cat is a simple but effective abstraction:

  • The token sequence x=(x1,x2,...,xN)x=(x_1, x_2, ..., x_N) is split into NC=⌈N/C⌉N_C = \lceil N/C \rceil chunks of CC tokens each.
  • Each chunk cic_i is compressed by a dense bidirectional transformer fθf_\theta into a fixed-sized vector fθ(ci)∈RDgf_\theta(c_i) \in \mathbb{R}^{D_g}.
  • The decoder gθg_\theta (a causal dense transformer) autoregressively generates each token in chunk cic_i conditioned on previous tokens in the chunk and the compressed representations of all previous chunks {fθ(cj)}j<i\{f_\theta(c_j)\}_{j<i}.

This results in a recurrent, chunkwise context representation, yet all compression and decoding steps can be parallelized in training, sidestepping slow sequential computation present in earlier compressive or recurrent designs. Training employs a next-token prediction loss, with the architecture supporting efficient manipulation of chunk size both during and after training.

Efficient implementation is achieved via an augmented attention mask which interleaves compressed chunk representations among the tokens and restricts attention as needed to allow for key-value re-use and efficient quadratic compute (O(N2/C)O(N^2/C) as opposed to O(N2)O(N^2) in vanilla transformers), fully supported in frameworks like PyTorch using custom attention APIs such as FlexAttention. Generation is even more efficient, as past raw tokens can be discarded and only compressed chunk representations are kept in the attention cache, yielding direct memory savings up to the chunking factor CC. Figure 2

Figure 2: Sequence length is 128, and the attention mask shows chunk size C=16C=16.

Pseudocode for both training and generation is provided in the appendix of the paper, demonstrating performant implementation feasible with standard deep learning libraries, without the need for custom CUDA/Triton kernels.

Adaptive Chunk Training for Test-Time Control

A critical practical innovation in cat is adaptive training: rather than fixing chunk size, cat can be trained via randomized sampling of chunk sizes (and corresponding indicator tokens signaling chunk size) at each iteration. This results in a single model which can be flexibly deployed at various efficiency-quality points by changing an indicator token and the chunking pattern at test time, with no retraining required. This stands in sharp contrast to prior methods which bind the quality-efficiency trade-off at train time, or require training and maintaining separate models for each operational regime.

Empirical Evaluation and Results

cat was benchmarked comprehensively against standard and efficient transformer baselines—Dense Transformer, Sparse Transformer, Linear (Mamba2), GatedDeltaNet (GDN), and hybrid architectures—across language modeling, common-sense reasoning, long-context understanding, in-context recall, and synthetic recall/needle-in-haystack tasks.

Key empirical findings:

  • Language Modeling/Reasoning: All chunked cat variants (with chunk sizes C=4,8,16,32C=4,8,16,32) match or outperform baselines on language modeling and reasoning, with cat-32 performing best on short sequence benchmarks; all cats outperform other efficient baselines on common-sense reasoning (see Table 1 in the paper).
  • In-Context Recall and Long Contexts: cat's memory grows gracefully with sequence length, and it outperforms all efficient baselines, including dense and hybrid models, at similar or lower memory/computation, especially for moderate chunk sizes. At chunk sizes C=4,8C=4,8 cat even exceeds dense transformer in in-context recall at 2×+2\times+ memory efficiency. For long context tasks, degradation with context length is slower in cat as compared to all other approaches.
  • Generation Efficiency: cat generation throughput is $1.4$--3.2×3.2\times higher than dense transformer, with $2.2$--9.5×9.5\times lower memory usage as chunk size increases, despite increased total model parameters. Figure 3

Figure 3

Figure 3: Left: On MQAR, cat outperforms baselines at long sequences, memory-matched. Right: cat scaling mirrors dense transformers, but is up to 3×3\times faster and 9×9\times more memory efficient.

  • Scaling Behavior: Cat models scale similarly to dense transformers with respect to model size and training data, showing no adverse effect on loss curves or perplexity.
  • Chunk Size Trade-off: cat acts as a controllable test-time knob; increasing chunk size improves efficiency at the cost of slight recall/perplexity degradation, allowing practitioners to select trade-offs appropriate to their deployment or application context.

cat is architecturally distinct from both block/chunk-based models (e.g., Block Transformer, MegaByte) and memory-efficient recurrent/linear attention mixers:

  • Unlike block models, cat exposes all past chunk representations, not just a single fixed-size "global" compressed state, to the decoder. This removes the fixed-memory bottleneck and preserves recall ability.
  • Unlike prior recurrent compression-based transformers, cat's compression and decoding are parallelizable, enabling scalable pretraining and better hardware utilization.
  • Unlike linear attention and hybrid models, cat's memory is flexible (not fixed) and does not require hand-crafted state update rules, maintenance of raw token caches, or complex layer compositions.
  • cat's efficacy is robust at memory parity, and its efficiency does not depend on avoiding unstructured access or sparsification heuristics. Figure 4

    Figure 4: Block Transformer architecture. Cat's parallel attention to all previous chunk embeddings eliminates the fixed-memory bottleneck.

Implementation Considerations and Limitations

An efficient implementation of cat's custom attention mask is currently more expensive to compile than native FlashAttention, leading to somewhat increased training time (∼\sim2.35×\times at $4096$ seq length), though inference/generation is fully efficient and scalable. Memory efficiency is especially pronounced at high batch sizes and long contexts, which are critical in applications such as code generation, multi-user LLM deployment, and large-scale RL rollouts.

cat requires users to select chunk size at test time, but this can be made adaptive (even data-driven) with further work, potentially unlocking true context- and task-based efficiency routing. Furthermore, cat can be integrated as a composable layer rather than a full architecture, enabling hybrid and modular designs.

A limit remains on how much information can be faithfully encoded in fixed-size chunk representations—as context length and chunk size rise, recall fidelity for extremely long targets may degrade unless the compressor/decoder is sufficiently expressive (which may increase parameters, but not FLOPs per generation step).

Future Directions

cat opens pathways for more adaptive, data-dependent allocation of compute/memory budgets, via techniques such as reinforcement learning-based budgeting or task-aware dynamic chunk sizing. The compressor and decoder modules can, in principle, use any sequence-mixing primitive, not limited to transformers. Integration with Grouped Query or Multi-Query Attention, further hardware-optimized attention kernels, and combination with post-hoc cache sparsification can push the memory and compute savings further. Large-scale pretraining and evaluation on industrial workloads will clarify the full potential and operational advantages, especially for extremely long-context applications.

Conclusion

The Compress and Attend Transformer (cat) provides a conceptually simple yet highly effective architecture for controllably efficient sequence modeling. By chunking and compressing sequences before dense attention-based decoding, cat enables a single model to operate efficiently across a variety of compute/memory regimes, delivering superior recall and matching language modeling quality observed in dense transformers, while drastically lowering resource requirements and increasing generation throughput. Cat thus offers both a new baseline for efficient, long-context LLMs and a flexible toolkit for balancing quality and efficiency at deployment time.

Dice Question Streamline Icon: https://streamlinehq.com
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 37 likes.

Upgrade to Pro to view all of the tweets about this paper: