Attention-Induced Curvature

Updated 9 November 2025

Attention-induced curvature is a concept where Transformer attention heads learn and adjust latent space geometry by estimating curvature parameters.
It employs stereographic projections, Möbius addition, and parallel transport to map embeddings into constant-curvature spaces, optimizing representations for hierarchical graph data.
The FPS-T model utilizes a kernelized linear-time mechanism for non-Euclidean attention, achieving superior graph reconstruction and node classification with enhanced parameter efficiency.

Attention-induced curvature describes the phenomenon whereby the geometry of the latent representation space—specifically its curvature—is directly controlled and learned through attention mechanisms within neural architectures. This concept is central to the Fully Product-Stereographic Transformer (FPS-T), which operates all Transformer layers over a product of constant-curvature spaces. In this context, each attention head dynamically selects the appropriate geometry (spherical, Euclidean, or hyperbolic) by learning a curvature parameter κₕ during training. This enables the model to adapt its geometric inductive bias to better encode hierarchical or cyclical structures in graph data, leading to more compact and effective representations, especially as evidenced in tasks such as graph reconstruction and node classification.

1. Geometric Foundations: Product of Constant-Curvature Spaces

FPS-T generalizes the standard Transformer architecture by embedding query, key, and value representations not in the ordinary Euclidean vector space ℝᵈ, but in the product of multiple constant-curvature spaces (𝔖ᵈ_{κₕ}). Each such space is defined by a curvature parameter κ and a stereographic chart:

The κ-stereographic model: $\mathbb{S}^d_\kappa = \{ x \in \mathbb{R}^d : 1 + \kappa \|x\|^2 > 0 \}$
The conformal metric tensor: $g_\kappa(x) = (\lambda_x^\kappa)^2 I$ with $\lambda_x^\kappa = \frac{2}{1 + \kappa \|x\|^2}$
Möbius addition, the group operation on these manifolds:

$x \oplus_\kappa y = \frac{(1 - 2\kappa x^\top y - \kappa\|y\|^2)x + (1 + \kappa\|x\|^2)y}{1 - 2\kappa x^\top y + \kappa^2\|x\|^2\|y\|^2}$

The full representation space is a product-manifold $M = \mathbb{S}^{d}_{\kappa_1} \times \cdots \times \mathbb{S}^{d}_{\kappa_H}$ , providing independent curvature parameters for each of the $H$ attention heads. The geodesic distance in this product manifold is

$d_M(x, y) = \sqrt{\sum_{h=1}^{H} d_{\kappa_h}^2(x^{(h)}, y^{(h)})}$

where $d_\kappa(x, y) = 2\tan_\kappa^{-1}(\|{-}x \oplus_\kappa y\|)$ , generalizing Euclidean distance.

2. Attention on Tangent Spaces and Non-Euclidean Aggregation

Conventional non-Euclidean graph neural networks have often parameterized attention through explicit pairwise geodesic distances (e.g., $\mathrm{softmax}(-d(x_i, x_j))$ ), but FPS-T instead generalizes the scaled dot-product attention by explicitly mapping embeddings into tangent spaces:

Queries and keys $Q_i$ , $K_j$ are computed as elements of the tangent space at the respective values $V_i$ , via stereographic linear layers (i.e., $\exp_0 \circ W \circ \log_0$ ).
Parallel transport is used so that all tangent vectors are referenced at a common basepoint $0 \in \mathbb{S}^d_\kappa$ :

$\tilde{Q}_i = PT_{V_i \rightarrow 0}(Q_i), \quad \tilde{K}_j = PT_{V_j \rightarrow 0}(K_j)$

At the origin, the metric is conformally Euclidean, allowing the use of standard inner products:

$\alpha_{ij} = \langle \tilde{Q}_i, \tilde{K}_j \rangle_0$

The aggregation is realized via the Einstein midpoint on $\mathbb{S}^d_\kappa$ , using Möbius operations and conformal scaling.

This framework preserves the Transformer’s global and flexible attention capabilities while enabling each head to interpret geometric structure in a manner best suited to the observed graph.

3. Kernelized Linear-Time Non-Euclidean Attention

Standard attention mechanisms have quadratic time and memory complexity due to pairwise calculations. FPS-T circumvents this via a kernelized, linear-attention trick:

The tangent-space dot-product $\alpha_{ij}$ is approximated by a positive kernel $\phi(x) = \mathrm{ELU}(x) + 1$ so that

$\langle \tilde{Q}_i, \tilde{K}_j \rangle \approx \phi(\tilde{Q}_i)^\top \phi(\tilde{K}_j)$

Additional scaling incorporates the conformal factors from the stereographic model.
The aggregation operation then factorizes double sums into matrix-vector products, reducing complexity from $O((N+M)^2)$ to $O(N+M)$ per head, where $N$ and $M$ are the numbers of nodes and edges in the graph (TokenGT tokens).

The table below summarizes computational complexity per attention head:

Attention Type	Complexity	Notes
Exact	$O((N+M)^2)$	Full pairwise calculations
Kernelized (Linear)	$O(N+M)$	Via kernel feature trick

This approach allows FPS-T to scale to much larger graphs without sacrificing the geometric adaptivity conferred by attention-induced curvature.

4. End-to-End Learning of Curvature

Each head’s curvature $\kappa_h$ is treated as a learnable parameter:

Initial value $\kappa_h = 0$ (Euclidean)
All geometric operations (stereographic maps, Möbius addition, $\tan_\kappa$ , $\lambda^\kappa$ , $\exp$ , $\log$ , and parallel transport) are smooth in $\kappa$ and support backpropagation.
Gradients $\partial L / \partial \kappa_h$ are propagated through attention and feedforward computations, jointly optimized (typical learning rates: $1 \times 10^{-4}$ for curvature, $1 \times 10^{-2}$ for other weights with Adam optimizer).

As training proceeds, $\kappa_h$ learns to select hyperbolic ( $\kappa_h < 0$ ), Euclidean ( $\kappa_h = 0$ ), or spherical ( $\kappa_h > 0$ ) geometry as required by graph structure. For example, on the Web-Edu dataset (sectional curvature $\approx -0.63$ ), $\kappa$ shifted from 0 to $-0.5$ over training, supporting accurate embedding of hierarchical relationships.

A plausible implication is that models without curvature learning may be suboptimal when representing complex graph geometries, particularly in cases where oversmoothing and oversquashing are exacerbated by traditional message-passing designs.

5. Empirical Results and Parameter Efficiency

FPS-T demonstrates empirical advantages in both expressive power and efficiency:

On graph reconstruction and node classification benchmarks, FPS-T with learned curvature consistently outperforms fixed-Euclidean baselines, especially notable in settings with strong non-Euclidean graph structure.
In low-dimensional settings (feature dimension 4 vs. 16), FPS-T matches or exceeds the expressiveness of full-dimensional Euclidean Transformers while using only $\approx 25\%$ as many parameters, confirming that certain data manifolds benefit from intrinsic curvature.
Across eight node classification benchmarks (homophily in $[0.11, 0.81]$ ), FPS-T achieves leading performance on 6/8 datasets, with the greatest improvements on heterophilic graphs—structures that strongly deviate from plain Euclidean assumptions.
On the Web-Edu dataset, model performance (mAP) increased in tandem with curvature adaptation, whereas fixed-curvature models could not match this progression.

These findings support the conclusion that attention-induced curvature materially sharpens attention patterns and enhances predictive accuracy, particularly in challenging graph settings.

6. Implementation, Tokenization, and Practical Considerations

FPS-T is implemented using PyTorch, PyG (PyTorch Geometric), and Geoopt, leveraging their support for manifold-valued tensors and geometric optimization:

The graph is tokenized à la TokenGT into $N+M$ tokens (nodes plus edges); positional encoding is via Laplacian eigenvectors, and two token-type embeddings distinguish node versus edge tokens. Edge features encode only type and position.
Typical model depth is 1–3 layers, 1–4 attention heads, embedding dimension of 16, and standard regularization (dropout, weight decay) as tuned per dataset.
Kernelized attention yields linear complexity with respect to the number of nodes and edges; exact attention is retained for smaller or more tractable graphs.
No manual search over curvature initializations is required—a significant practical advantage compared to previous non-Euclidean networks.
Present limitations include cubic time complexity in ambient dimension for certain geometric operations (parallel transport, log/exp maps), and numerical instability as $|\kappa| \|x\|^2 \to 1$ . Future work may seek heterogeneous or input-dependent manifold structures for further adaptation.

By allowing the geometry of each attention head’s latent space to adapt to the observed data (attention-induced curvature), FPS-T achieves state-of-the-art performance and parameter efficiency in global-attention graph representation learning, without abandoning the strengths of Transformer architectures.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Attention-Induced Curvature.