- The paper demonstrates that multiplicative gating enables attention mechanisms to represent non-flat, curved manifolds, overcoming limitations of affine operators.
- It uses Fisher–Rao geometry to formalize a curvature gap and shows that curvature amplifies with depth and correlates positively with performance on nonlinear tasks.
- Empirical results confirm significant accuracy improvements for gated models on tasks requiring non-linear decision boundaries while maintaining comparable performance on linear tasks.
Geometric Expressivity in Attention: The Role of Multiplicative Gating
Introduction
This work analyzes the geometric expressivity inherent to attention mechanisms, focusing on the comparative representational geometry generated by standard (ungated) attention and attention augmented with multiplicative gating. The authors formalize a geometric gap, showing that gating introduces an intrinsic curvature in the representational manifold of attention outputs, which is unattainable by affine, ungated operators. They establish this geometric constraint using tools from information geometry—specifically, by endowing the output space with the Fisher–Rao metric induced by Gaussian decoders, thereby isolating curvature attributable solely to the attention mechanism.
Theoretical Foundations
The main technical thread draws on the Fisher–Rao geometry of statistical manifolds. Attention outputs are modeled as mean parameters of Gaussians with fixed covariance, rendering the induced geometry equivalent to the Euclidean structure up to linear transformations. Thus, nonzero curvature of the representational manifold must stem from the attention architecture itself.
For standard attention, outputs are affine combinations of value vectors Ui—i.e., for input X, Y(X)=∑i=1nαi(X)Ui where α(X)∈Δn−1. The image lies within an affine subspace, and a core result is that such affine statistical manifolds are always intrinsically flat. This restriction reveals the geometric underpinning of the low-rank bottleneck widely discussed in the attention literature, now interpreted as a limitation to flat geometry rather than merely an algebraic feature (see also [qiu2025gated]).
Figure 1: Geometric intuition for curvature generation in attention. Left: Ungated attention produces affine (flat) representations; right: Multiplicative gating enables non-affine, curved representations.
In contrast, multiplicative gating modulates the output elementwise, breaking the affine structure and generically inducing non-flat statistical manifolds. The authors demonstrate that for almost all smooth gating functions, the resulting mapping exhibits nontrivial second derivatives, giving rise to nonzero curvature. They provide constructive proofs, including explicit realizations where the gated model parametrizes a spherical manifold with constant positive curvature, impossible to attain via ungated attention. This holds not just for artificial constructions but within standard content-aware transformer blocks.
Curvature as a Measure of Expressivity
The curvature introduced by gating captures geometric expressivity beyond the algebraic rank of transformations. The theoretical analysis further establishes the stability of this curvature gap: generic perturbations to the gating function preserve non-flatness, and in the neighborhood of any "curved" gating configuration, a substantial region of parameter space realizes strictly positive curvature.
Moreover, the geometric effect of gating is shown to amplify under depth. In architectures permitting alignment of contributions from multiple gated layers, the cumulative curvature of the composed representational manifold grows quadratically with the number of layers under suitable structural conditions.
Figure 2: Gating increases representation curvature; curvature rises monotonically with gate strength.
Empirical Evaluation
Synthetic experiments substantiate the geometric expressivity gap observed in theory. The authors design a sequence classification task with a nonlinear (curved) boundary and compare ungated, pointwise nonlinear (SiLU), and multiplicatively gated attention models.
Empirical proxies for curvature, based on finite-difference approximations of second-order variation in representation, confirm that:
- Gated models exhibit significantly higher curvature in their representation spaces compared to both ungated and pointwise nonlinear models.
- Curvature correlates positively with task accuracy: On tasks requiring non-linear geometry, models with higher curvature in their representations demonstrate increased classification performance.
Figure 3: Decision boundaries in latent space. Gated models track nonlinear boundaries more faithfully than ungated ones.
Figure 4: Test accuracy as a function of isotropic attention curvature. Higher curvature is associated with better performance, saturating as curvature increases.
Ablation studies reveal that gains in geometric expressivity and accuracy arise specifically from multiplicative gating and not from generic nonlinearity.
Figure 5: Ablation of attention variants. Only multiplicative gating substantially increases geometric expressivity and accuracy.
Importantly, when the task does not require nonlinear structure—e.g., for a linear decision boundary—gated models do not outperform their flat (ungated) counterparts, despite still realizing more curved representations.
Figure 6: Attention curvature versus gate strength for linear control task. Curvature increases, but does not yield a performance boost.
Figure 7: Accuracy versus attention curvature for linear task. No consistent positive relationship; curvature is not beneficial for linearly separable problems.
Implications and Outlook
This analysis identifies a precise geometric expressivity gap induced by gating within attention mechanisms. The results formalize constraints and extensions on the set of representational geometries attainable by transformer-like models. Practically, multiplicative gating is shown to equip attention layers with the ability to natively capture nonlinear and manifold structures in a single block. The phenomenon of curvature amplification through depth suggests that deeper stacks of gated attention can systematically increase geometric expressivity, potentially improving the inductive bias of deep learning models on structured tasks.
On a theoretical level, these findings connect the algebraic analysis of neural architectures to the intrinsic geometry of their representation spaces, suggesting new lenses for understanding capability limits and design tradeoffs in foundation models. The robustness and genericity of gating-induced curvature point toward a broader class of geometric mechanisms through which architectural innovations can enhance model expressivity.
Future work may include:
- Investigating relationships between geometric expressivity and generalization, robustness, or sample efficiency.
- Extending the geometric analysis to other forms of gating and nonlinearities, or to architectures with input-dependent statistical decoders.
- Studying curvature effects in architectures pre-trained on large, real-world datasets, and exploring links between learned geometry and downstream task transfer or compositionality.
Conclusion
This paper rigorously establishes that multiplicative gating enables attention mechanisms to realize intrinsically curved statistical manifolds, strictly expanding the class of geometric structures representable by such models. Gated attention blocks exhibit robust, locally generic curvature, and this property accumulates systematically with depth in aligned regimes. Empirical evidence supports the claim that geometric curvature in representations correlates with improved performance on tasks demanding nonlinear inductive biases, while offering no spurious benefit on linearly separable tasks. These results have both significant theoretical implications for the geometry of neural representations and practical relevance for the design of more expressive deep learning architectures.