Scaling behavior of gradient-boosted attention improvements

Determine whether the performance improvement achieved by gradient-boosted attention over standard attention persists when training transformer models at larger scales, specifically in the 100M–1B parameter regime commonly used to evaluate modern attention variants.

Background

The paper introduces gradient-boosted attention, a two-round attention mechanism where the second pass attends to the residual of the first using separate learned projections and a learned gate, mapping the within-layer procedure to gradient boosting under a squared reconstruction objective. On a 10M-token subset of WikiText-103 with small models (7–9M parameters), the method reduces test perplexity relative to standard attention, Twicing Attention, and a parameter-matched wider baseline, with most gains captured by two rounds.

In the discussion, the authors note that all experiments are at small scale and explicitly state that it is not yet established whether the observed improvement persists at larger model scales (100M–1B parameters), which is the regime where many modern attention variants are typically evaluated. Establishing this scaling behavior is highlighted as a central open question for future work.

References

While the param-fair comparison controls for capacity, we have not yet established whether the improvement persists at the 100M--1B scale where most modern attention variants are evaluated \citep{ye2025differential}.

Gradient Boosting within a Single Attention Layer  (2604.03190 - Sargolzaei, 3 Apr 2026) in Limitations paragraph, Section Discussion