Efficient attention kernel for CAT training
Develop an efficient self-attention kernel for training the Compress and Attend Transformer (cat) that supports the custom decoder attention mask where each token in chunk c_i attends only to previous tokens in c_i and to past compressed chunk representations f_θ(c_{i−1}), …, f_θ(c_1). The objective is to enable scalable training with reduced attention compute and improved wall-clock throughput relative to standard dense-transformer kernels.
References
Developing an efficient attention kernel for training cat is left as future work.
— Attention and Compression is all you need for Controllably Efficient Language Models
(2511.05313 - Prakash et al., 7 Nov 2025) in Appendix, Section "Training throughput analysis" (app:cat_training_throughput)