Effect of Coupled Query–Key Dynamics at Billion-Parameter Scale

Determine whether incorporating coupled query–key dynamics—jointly evolving query and key representations through shared learned dynamics before scaled dot-product attention scoring—improves or degrades performance when applied to transformer language models at billion-parameter scale, beyond the 60M–350M parameter range evaluated in this study.

Background

The paper introduces coupled query–key (QK) dynamics, where queries and keys co-evolve via shared learned dynamics before scoring, and demonstrates consistent perplexity improvements over standard attention at 60M and 150M parameters, with a smaller gain at 350M.

While the method shows corpus-dependent benefits—performing well on domain-coherent corpora like WikiText-103 and PubMed but degrading on heterogeneous web text—the evaluation is limited to models up to 350M parameters. The authors explicitly note that it remains unknown whether these benefits persist or reverse at billion-parameter scale, identifying a key open question about scalability.

References

Our evaluation spans 60M--350M parameters; whether coupled dynamics helps or hurts at billion-parameter scale remains open.

Coupled Query-Key Dynamics for Attention  (2604.01683 - Gahtan et al., 2 Apr 2026) in Conclusion, Limitations