Low-rank attention via weight decay and overfitting

Determine whether, in transformer language model pretraining with AdamW, applying weight decay that encourages the query–key product matrix W_{QK} toward a lower-rank configuration prevents overfitting to high-dimensional noise in the pretraining data distribution.

Background

The paper investigates how the weight decay hyperparameter used during pretraining affects LLM plasticity and proposes mechanistic explanations. One observed effect is that larger weight decay values tend to reduce the rank of attention matrices, particularly the query–key product W_{QK}.

Building on this observation and prior results linking low-rank constraints to simpler, more robust hypotheses, the authors articulate a conjecture that connects weight-decay-induced low-rank attention to reduced overfitting. This hypothesis aims to explain why increased weight decay can improve downstream adaptability after fine-tuning.

References

We conjecture that by encouraging $W_{QK}$ toward a lower-rank configuration, weight decay may prevent the model from overfitting to high-dimensional noise in the pretraining distribution.

Weight Decay Improves Language Model Plasticity  (2602.11137 - Han et al., 11 Feb 2026) in Section 4.2, Low-rank structure as a driver of adaptability