Low-rank attention via weight decay and overfitting
Determine whether, in transformer language model pretraining with AdamW, applying weight decay that encourages the query–key product matrix W_{QK} toward a lower-rank configuration prevents overfitting to high-dimensional noise in the pretraining data distribution.
References
We conjecture that by encouraging $W_{QK}$ toward a lower-rank configuration, weight decay may prevent the model from overfitting to high-dimensional noise in the pretraining distribution.
— Weight Decay Improves Language Model Plasticity
(2602.11137 - Han et al., 11 Feb 2026) in Section 4.2, Low-rank structure as a driver of adaptability