Curvature-Adaptive Transformer (CAT)
- Curvature-Adaptive Transformer (CAT) is a model that generalizes traditional Transformers to operate over mixed-curvature manifolds, including Euclidean, spherical, and hyperbolic spaces.
- It employs curvature-adaptive attention mechanisms and per-token geometric routing to dynamically fit complex data structures such as graphs and knowledge graphs.
- Leveraging kernelization and parameter sharing, CAT achieves notable performance improvements on graph tasks while maintaining only a modest computational and parameter overhead.
Curvature-Adaptive Transformer (CAT) generalizes the Transformer architecture to enable adaptive geometric specialization over non-Euclidean and Euclidean spaces. CAT achieves this either by learning a product of constant-curvature Riemannian manifolds per attention head, or by routing tokens over parallel geometric branches (Euclidean, spherical, hyperbolic) according to their local relational structure. This approach endows the model with the ability to fit the intrinsic geometry of data—particularly graphs and knowledge graphs—where flat, hierarchical, and cyclic structures co-exist. CAT retains the global, long-range receptive field of Transformers, remedies oversquashing/oversmoothing phenomena present in message-passing graph models, and achieves this with only modest computational overhead compared to standard architectures (Cho et al., 2023, Lin et al., 2 Oct 2025).
1. Mixed-Curvature Product Manifolds and Stereographic Parameterization
CAT extends Transformer self-attention to mixed-curvature product manifolds. Let denote -dimensional Riemannian manifolds, each with constant sectional curvature . The architecture defines the embedding space as the product manifold
A point decomposes as , with inter-point distances and metrics computed component-wise:
The product structure permits each attention head to operate in a distinct constant-curvature geometry (hyperbolic , Euclidean , spherical ), with each learned end-to-end.
CAT implements all Riemannian operations (exp/log maps, parallel transport) using a stereographic chart of curvature :
Exponential and logarithmic maps admit closed-form expressions as smooth functions of : where
yielding canonical geometric behavior in each curvature regime (Cho et al., 2023).
2. Curved Attention Mechanisms and Head-Specific Curvature Learning
Each attention head within the multi-head self-attention module is parameterized by a learned curvature , initialized at zero (Euclidean) but permitted to take any real value as optimization proceeds. Token embeddings are projected to the tangent space at the reference point, followed by the exponential map onto the appropriate manifold. Projections are computed as: Comparisons between and require parallel transport to the origin tangent space and then a standard dot product.
The resulting non-Euclidean attention kernel is defined as
where denotes parallel transport. All operations are differentiable in , so gradients with respect to curvature flow naturally, and each attention head's curvature adapts to the learning objective and the geometry of the data (Cho et al., 2023).
3. Mixture-of-Geometry via Per-Token Routing Networks
An alternative CAT formulation replaces each Transformer block with a "CATBlock" that exposes three parallel attention branches—Euclidean, spherical, and hyperbolic—receiving for each token a convex mixing weight. A lightweight, differentiable MLP predicts, for each token, a 3D logit vector transformed by softmax to yield routing probabilities over the three geometry experts:
Geometry-specific attention is performed in each branch using manifold-optimized projection, aggregation, and feed-forward submodules (Euclidean: standard; Spherical: exponential/logarithmic mapping to/from the tangent space at the sphere north pole; Hyperbolic: operations in the Poincaré ball with Möbius addition and scalar multiplication). The final output for each token is a convex combination of the three branches' results, with weights set by the learned per-token routing (Lin et al., 2 Oct 2025).
4. Computational Efficiency via Kernelization and Parameter Sharing
Naively extending attention to non-Euclidean manifolds or three parallel attention branches would result in infeasible computational complexity and tripled parameter count. CAT addresses this by:
- Kernelizing the curved attention, using adapted feature maps (e.g., ) so that attention-dot products can be factorized as , enabling compute and memory scaling.
- Sharing large feed-forward and embedding layers across all geometry branches, resulting in approximately 5% parameter overhead and minimal runtime increase compared to standard Transformers.
The table below summarizes empirical parameter counts and inference times for knowledge graph completion (Lin et al., 2 Oct 2025):
| Model | FB15k-237 Params | FB15k-237 Time (ms) | WN18RR Params | WN18RR Time (ms) |
|---|---|---|---|---|
| Fixed-Euclidean | 979 K | 0.781 | 2.65 M | 1.012 |
| Fixed-Hyperbolic | 975 K | 3.280 | 2.65 M | 3.592 |
| Fixed-Spherical | 967 K | 1.313 | 2.65 M | 1.533 |
| CAT | 1,032 K | 4.885 | 2.71 M | 5.566 |
This demonstrates that CAT amortizes the cost of geometric specialization and can be trained at scale.
5. Empirical Performance and Interpretability
CAT achieves strong empirical results across graph and knowledge graph tasks:
- In graph reconstruction (mean Average Precision), CAT learns dataset-specific curvatures (e.g., for Web-Edu, Power grid, Facebook, Bio-Worm) and achieves up to 99.00% mAP compared to 89.56% for Euclidean-only TokenGT, using lower embedding dimension (e.g., 4d CAT outperforms 16d Euclidean baseline) (Cho et al., 2023).
- For node classification on heterophilic and homophilic datasets, CAT outperforms Euclidean TokenGT by up to 8.3 percentage points macro-F1 and matches or exceeds baseline GCNs and hyperbolic GCNs on all benchmarks (Cho et al., 2023).
- On knowledge graph link prediction (FB15k-237, WN18RR), CAT yields relative improvements of ≈10% in MRR and Hits@10 over the best fixed-geometry Transformer, with parameter count and inference time comparable to a single baseline model (Lin et al., 2 Oct 2025).
Per-token routing probabilities in the mixture-of-geometry CATBlock are readily interpretable: tokens with high hyperbolic weighting correspond to locally tree-like (hierarchical) structure, high spherical weights indicate cyclic/angular context, and high Euclidean weights fit locally flat substructure. Visualization of routing heatmaps reveals that models specialize tokens in semantically meaningful ways (e.g., favoring Euclidean and hyperbolic geometry in knowledge graphs with predominantly flat/hierarchical relations) (Lin et al., 2 Oct 2025).
6. Training Protocols, Losses, and Layerwise Geometry Specialization
Training is fully differentiable: each head's curvature or each token's routing weights are updated via backpropagation. The routing MLP in mixture-of-geometry CAT is regularized with an entropy term to encourage exploratory mixture distributions early during optimization: The entropy coefficient is annealed through training to enable initial exploration (soft distribution over geometry experts) and late specialization (sharp, interpretable routing) (Lin et al., 2 Oct 2025). Across layers, the ability for curvatures or expert selection to vary facilitates hierarchical mixtures-of-geometry, a property unattainable in prior architecture which required a manual, global geometry choice.
7. Extensions, Domains, and Future Prospects
Potential extensions of CAT include:
- Applying CAT to text, vision, temporal, and multimodal domains wherein distinct regions may manifest different geometric structure (e.g., cyclic syntax—spherical; hierarchical parses—hyperbolic; 360° imagery).
- Introducing dynamic, per-layer geometry adaptation or enriching the geometry expert set beyond .
- Pruning or distilling underutilized branches for efficient inference, or leveraging geometry-grounded interpretability to relate learned routing/specialization to data properties (e.g., parse tree depth, relation type) (Lin et al., 2 Oct 2025).
- The seamless inductive shift from "single global geometry" to "tokenwise or headwise curvature" positions CAT as a viable basis for geometry-adaptive foundation models addressing the true geometric complexity of real-world relational data.
In summary, Curvature-Adaptive Transformers provide a unified, end-to-end differentiable generalization of self-attention architectures to the product of constant-curvature manifolds (per-head) or mixture of geometric experts (per-token), supporting interpretable, data-driven adaptation of geometric inductive bias and yielding performance gains with modest overhead (Cho et al., 2023, Lin et al., 2 Oct 2025).