Scaling behavior of gradient-boosted attention improvements
Determine whether the performance improvement achieved by gradient-boosted attention over standard attention persists when training transformer models at larger scales, specifically in the 100M–1B parameter regime commonly used to evaluate modern attention variants.
References
While the param-fair comparison controls for capacity, we have not yet established whether the improvement persists at the 100M--1B scale where most modern attention variants are evaluated \citep{ye2025differential}.
— Gradient Boosting within a Single Attention Layer
(2604.03190 - Sargolzaei, 3 Apr 2026) in Limitations paragraph, Section Discussion