Lower bound on CAT compressor depth

Determine the minimal (lower bound) transformer depth required for the CAT compressor to effectively produce chunk representations that preserve downstream performance, including whether a single-layer compressor suffices for accurate compression of token chunks.

Background

Empirical ablations show little difference in perplexity between compressor depths of 3 and 6, suggesting shallow compressors may suffice.

The authors explicitly pose whether compression can be achieved with a one-layer compressor and note the possibility of a lower bound on compressor depth, leaving the exact bound and its implications for performance open.

References

However, what is the limit, and can one go to even a 1 layer of compressor is an interesting question to ask. There might be some lower bound on the compressor depth to start compressing chunks of tokens, but we leave this to future work.

— Attention and Compression is all you need for Controllably Efficient Language Models (2511.05313 - Prakash et al., 7 Nov 2025) in Appendix, Section "Ablation on depth of the compressor"

Lower bound on CAT compressor depth

Background

References

Related Problems