Papers
Topics
Authors
Recent
Search
2000 character limit reached

Curvature-Adaptive Transformer (CAT)

Updated 25 February 2026
  • Curvature-Adaptive Transformer (CAT) is a model that generalizes traditional Transformers to operate over mixed-curvature manifolds, including Euclidean, spherical, and hyperbolic spaces.
  • It employs curvature-adaptive attention mechanisms and per-token geometric routing to dynamically fit complex data structures such as graphs and knowledge graphs.
  • Leveraging kernelization and parameter sharing, CAT achieves notable performance improvements on graph tasks while maintaining only a modest computational and parameter overhead.

Curvature-Adaptive Transformer (CAT) generalizes the Transformer architecture to enable adaptive geometric specialization over non-Euclidean and Euclidean spaces. CAT achieves this either by learning a product of constant-curvature Riemannian manifolds per attention head, or by routing tokens over parallel geometric branches (Euclidean, spherical, hyperbolic) according to their local relational structure. This approach endows the model with the ability to fit the intrinsic geometry of data—particularly graphs and knowledge graphs—where flat, hierarchical, and cyclic structures co-exist. CAT retains the global, long-range receptive field of Transformers, remedies oversquashing/oversmoothing phenomena present in message-passing graph models, and achieves this with only modest computational overhead compared to standard architectures (Cho et al., 2023, Lin et al., 2 Oct 2025).

1. Mixed-Curvature Product Manifolds and Stereographic Parameterization

CAT extends Transformer self-attention to mixed-curvature product manifolds. Let {Mki}i=1m\{M_{k_i}\}_{i=1}^m denote dd-dimensional Riemannian manifolds, each with constant sectional curvature kiRk_i \in \mathbb R. The architecture defines the embedding space as the product manifold

M=Mk1×Mk2××Mkm.M = M_{k_1} \times M_{k_2} \times \dots \times M_{k_m}.

A point xMx \in M decomposes as (x(1),x(2),,x(m))(x^{(1)}, x^{(2)}, \ldots, x^{(m)}), with inter-point distances and metrics computed component-wise:

dM2(x,y)=i=1mdki2(x(i),y(i)).d_M^2(x, y) = \sum_{i=1}^m d_{k_i}^2(x^{(i)}, y^{(i)}).

The product structure permits each attention head to operate in a distinct constant-curvature geometry (hyperbolic k<0k < 0, Euclidean k=0k = 0, spherical k>0k > 0), with each kik_i learned end-to-end.

CAT implements all Riemannian operations (exp/log maps, parallel transport) using a stereographic chart of curvature κ\kappa:

stκd={zRd:1+κz2>0}.\text{st}_\kappa^d = \{z \in \mathbb R^d : 1 + \kappa \|z\|^2 > 0\}.

Exponential and logarithmic maps admit closed-form expressions as smooth functions of κ\kappa: λxκ=21+κx2,exp0κ(v)=tanκ(v)vv,log0κ(z)=1λzκtanκ1(z)zz,\lambda_x^\kappa = \frac{2}{1 + \kappa\|x\|^2}, \quad \exp_0^\kappa(v) = \tan_\kappa(\|v\|) \frac{v}{\|v\|}, \quad \log_0^\kappa(z) = \frac{1}{\lambda_z^\kappa} \tan_\kappa^{-1}(\|z\|) \frac{z}{\|z\|}, where

tanκ(r)={(1/κ)tan(κr)κ>0 rκ=0 (1/κ)tanh(κr)κ<0\tan_\kappa(r) = \begin{cases} (1/\sqrt{\kappa}) \tan(\sqrt{\kappa}\, r) & \kappa > 0 \ r & \kappa = 0 \ (1/\sqrt{-\kappa}) \tanh(\sqrt{-\kappa}\, r) & \kappa < 0 \end{cases}

yielding canonical geometric behavior in each curvature regime (Cho et al., 2023).

2. Curved Attention Mechanisms and Head-Specific Curvature Learning

Each attention head within the multi-head self-attention module is parameterized by a learned curvature κh\kappa_h, initialized at zero (Euclidean) but permitted to take any real value as optimization proceeds. Token embeddings are projected to the tangent space at the reference point, followed by the exponential map onto the appropriate manifold. Projections Q,K,VQ,K,V are computed as: Qi=logViκ(XiWQ),Ki=logViκ(XiWK),Vi=XiWV.Q_i = \log_{V_i}^\kappa(X_i W^Q), \quad K_i = \log_{V_i}^\kappa(X_i W^K), \quad V_i = X_i W^V. Comparisons between QiQ_i and KjK_j require parallel transport to the origin tangent space and then a standard dot product.

The resulting non-Euclidean attention kernel is defined as

K(Qi,Kj)=exp(PTVi0Qi,PTVj0Kj0/d),K(Q_i,K_j) = \exp(\langle PT_{V_i\rightarrow 0}Q_i, PT_{V_j\rightarrow 0}K_j \rangle_0/\sqrt{d'}),

where PTPT denotes parallel transport. All operations are differentiable in κ\kappa, so gradients with respect to curvature flow naturally, and each attention head's curvature adapts to the learning objective and the geometry of the data (Cho et al., 2023).

3. Mixture-of-Geometry via Per-Token Routing Networks

An alternative CAT formulation replaces each Transformer block with a "CATBlock" that exposes three parallel attention branches—Euclidean, spherical, and hyperbolic—receiving for each token a convex mixing weight. A lightweight, differentiable MLP predicts, for each token, a 3D logit vector transformed by softmax to yield routing probabilities over the three geometry experts: α=[αE,αH,αS]=softmax(MLP(X))RB×N×3.\boldsymbol\alpha = [\alpha^E,\alpha^H,\alpha^S]=\text{softmax}(\text{MLP}(\mathbf X)) \in \mathbb R^{B\times N\times 3}.

Geometry-specific attention is performed in each branch using manifold-optimized projection, aggregation, and feed-forward submodules (Euclidean: standard; Spherical: exponential/logarithmic mapping to/from the tangent space at the sphere north pole; Hyperbolic: operations in the Poincaré ball with Möbius addition and scalar multiplication). The final output for each token is a convex combination of the three branches' results, with weights set by the learned per-token routing (Lin et al., 2 Oct 2025).

4. Computational Efficiency via Kernelization and Parameter Sharing

Naively extending attention to non-Euclidean manifolds or three parallel attention branches would result in infeasible O(n2)O(n^2) computational complexity and tripled parameter count. CAT addresses this by:

  • Kernelizing the curved attention, using adapted feature maps (e.g., ϕ(x)=ELU(x)+1\phi(x)=\mathrm{ELU}(x)+1) so that attention-dot products can be factorized as ϕ(Q~),ϕ(K~)\langle \phi(\tilde Q), \phi(\tilde K) \rangle, enabling O(n)O(n) compute and memory scaling.
  • Sharing large feed-forward and embedding layers across all geometry branches, resulting in approximately 5% parameter overhead and minimal runtime increase compared to standard Transformers.

The table below summarizes empirical parameter counts and inference times for knowledge graph completion (Lin et al., 2 Oct 2025):

Model FB15k-237 Params FB15k-237 Time (ms) WN18RR Params WN18RR Time (ms)
Fixed-Euclidean 979 K 0.781 2.65 M 1.012
Fixed-Hyperbolic 975 K 3.280 2.65 M 3.592
Fixed-Spherical 967 K 1.313 2.65 M 1.533
CAT 1,032 K 4.885 2.71 M 5.566

This demonstrates that CAT amortizes the cost of geometric specialization and can be trained at scale.

5. Empirical Performance and Interpretability

CAT achieves strong empirical results across graph and knowledge graph tasks:

  • In graph reconstruction (mean Average Precision), CAT learns dataset-specific curvatures (e.g., κ1=[0.63,0.28,0.08,0.03]\kappa^1=[-0.63,-0.28,-0.08,-0.03] for Web-Edu, Power grid, Facebook, Bio-Worm) and achieves up to 99.00% mAP compared to 89.56% for Euclidean-only TokenGT, using lower embedding dimension (e.g., 4d CAT outperforms 16d Euclidean baseline) (Cho et al., 2023).
  • For node classification on heterophilic and homophilic datasets, CAT outperforms Euclidean TokenGT by up to 8.3 percentage points macro-F1 and matches or exceeds baseline GCNs and hyperbolic GCNs on all benchmarks (Cho et al., 2023).
  • On knowledge graph link prediction (FB15k-237, WN18RR), CAT yields relative improvements of ≈10% in MRR and Hits@10 over the best fixed-geometry Transformer, with parameter count and inference time comparable to a single baseline model (Lin et al., 2 Oct 2025).

Per-token routing probabilities in the mixture-of-geometry CATBlock are readily interpretable: tokens with high hyperbolic weighting correspond to locally tree-like (hierarchical) structure, high spherical weights indicate cyclic/angular context, and high Euclidean weights fit locally flat substructure. Visualization of routing heatmaps reveals that models specialize tokens in semantically meaningful ways (e.g., favoring Euclidean and hyperbolic geometry in knowledge graphs with predominantly flat/hierarchical relations) (Lin et al., 2 Oct 2025).

6. Training Protocols, Losses, and Layerwise Geometry Specialization

Training is fully differentiable: each head's curvature or each token's routing weights are updated via backpropagation. The routing MLP in mixture-of-geometry CAT is regularized with an entropy term to encourage exploratory mixture distributions early during optimization: L=LCE+λent(1BNb,ng{E,H,S}πb,n,glogπb,n,g).\mathcal L = \mathcal L_\text{CE} + \lambda_\text{ent} \left(-\frac{1}{BN} \sum_{b,n}\sum_{g\in\{E,H,S\}} \pi_{b,n,g} \log \pi_{b,n,g}\right). The entropy coefficient λent\lambda_\text{ent} is annealed through training to enable initial exploration (soft distribution over geometry experts) and late specialization (sharp, interpretable routing) (Lin et al., 2 Oct 2025). Across layers, the ability for curvatures or expert selection to vary facilitates hierarchical mixtures-of-geometry, a property unattainable in prior architecture which required a manual, global geometry choice.

7. Extensions, Domains, and Future Prospects

Potential extensions of CAT include:

  • Applying CAT to text, vision, temporal, and multimodal domains wherein distinct regions may manifest different geometric structure (e.g., cyclic syntax—spherical; hierarchical parses—hyperbolic; 360° imagery).
  • Introducing dynamic, per-layer geometry adaptation or enriching the geometry expert set beyond {E,H,S}\{E, H, S\}.
  • Pruning or distilling underutilized branches for efficient inference, or leveraging geometry-grounded interpretability to relate learned routing/specialization to data properties (e.g., parse tree depth, relation type) (Lin et al., 2 Oct 2025).
  • The seamless inductive shift from "single global geometry" to "tokenwise or headwise curvature" positions CAT as a viable basis for geometry-adaptive foundation models addressing the true geometric complexity of real-world relational data.

In summary, Curvature-Adaptive Transformers provide a unified, end-to-end differentiable generalization of self-attention architectures to the product of constant-curvature manifolds (per-head) or mixture of geometric experts (per-token), supporting interpretable, data-driven adaptation of geometric inductive bias and yielding performance gains with modest overhead (Cho et al., 2023, Lin et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Curvature-Adaptive Transformer (CAT).