Spherical Equivariant Graph Transformers

Updated 19 December 2025

Spherical equivariant graph transformers are neural architectures that integrate spherical harmonic representations with transformer attention to strictly enforce rotational (SO(3)) and roto-translational (SE(3)) equivariance on 3D data.
They fuse group-theoretic function spaces, such as spherical harmonics and Clebsch–Gordan algebra, with attention-based message passing to enhance sample efficiency and predictive accuracy.
These models achieve state-of-the-art performance in molecular property prediction and force field modeling, scaling to higher representation degrees while maintaining low equivariance error.

Spherical equivariant graph transformers are a class of neural network architectures designed to process three-dimensional molecular and geometric data while strictly respecting physical symmetries—specifically rotational (SO(3)) and (in some variants) roto-translational (SE(3)) equivariance. These models encode tensor-valued features on nodes and edges of spatial graphs, propagate information via attention and message-passing primitives that are equivariant under symmetry group actions, and enforce physical consistency in prediction tasks. The central innovation is the fusion of group-theoretic function spaces—built from spherical harmonics, Wigner-D matrices, and Clebsch–Gordan algebra—with transformer attention and nonlinear mixing, resulting in models that are both sample efficient and demonstrably expressive for molecular property prediction, force field modeling, and scientific representation learning (2505.23086, Tang, 15 Dec 2025, Fuchs et al., 2020).

1. Mathematical Foundations: SO(3) Equivariance, Spherical Tensors, and Harmonics

Spherical equivariant graph transformers operate on graph-structured data (nodes with spatial coordinates and features, edges defined by molecular connectivity or geometric proximity) with a design that renders all intermediate computations covariant under rotations, and optionally translations.

Key mathematical constructs:

SO(3) group: The space of rotations in three dimensions, represented as $3 \times 3$ orthogonal matrices with determinant $+1$ .
Irreducible representations ( $l=0,1,2,\dots$ ): Each order $l$ defines a $(2l+1)$ -dimensional Wigner-D representation $D^{(l)}(R)$ of rotation $R$ , acting on “type- $l$ ” spherical tensors.
Spherical harmonics $Y^{(l,m)}(\mathbf{p})$ : Real or complex-valued functions defining an orthonormal basis on the sphere $\mathbb{S}^2$ , which transform via $Y^{(l)}(R\mathbf{p}) = D^{(l)}(R) Y^{(l)}(\mathbf{p})$ .
Tensor product and Clebsch–Gordan decomposition: The tensor product of two spherical tensors of types $l_1$ and $l_2$ decomposes as $\bigoplus_{l=|l_1-l_2|}^{l_1+l_2}$ , with coefficients given by Clebsch–Gordan numbers.

Node features are represented as collections of such irreducible tensors:

$\mathbf{x}_n = [x_n^{(0)}, \mathbf{x}_n^{(1)}, \ldots, \mathbf{x}_n^{(L)}] \in \bigoplus_{l=0}^L \mathbb{V}_l$

Each $\mathbb{V}_l$ is a “fiber” of dimension $(2l + 1)$ , and all operations on these features are constructed to satisfy rotation equivariance (Tang, 15 Dec 2025).

2. Traditional Spherical Equivariant Graph Architectures

Historical paradigms—Tensor Field Networks (TFN), SE(3)-Transformer, related EGNNs—implement equivariant message passing and attention by:

Constructing steerable convolution kernels as

$W^{lk}(\mathbf{x}) = \sum_{J=|k-l|}^{k+l} \phi^{lk}_J(\|\mathbf{x}\|) W^{lk}_J(\hat{\mathbf{x}})$

where $\phi^{lk}_J(r)$ are learnable radial functions (typically via MLP), $W^{lk}_J$ is an angular basis determined by Clebsch–Gordan and spherical harmonics. This construction ensures $W(R \mathbf{x}) = D_l(R) W(\mathbf{x}) D_k(R)^{-1}$ .

Defining attention weights from invariant dot products between query and key fibers, with messages aggregated using learned, equivariant kernels (Fuchs et al., 2020, Tang, 15 Dec 2025).
Stacking these components to yield multi-layer, multi-head equivariant transformers, usually truncating $L$ (the maximum irreducible representation degree) to manage complexity.

Such models guarantee SO(3) or SE(3) equivariance by construction and have achieved state-of-the-art accuracy in molecular property prediction, dynamics simulation, and structural modeling (Fuchs et al., 2020, Fuchs et al., 2021, Tang, 15 Dec 2025).

3. Equivariant Spherical Transformer (EST): Architecture and Spherical Attention

The Equivariant Spherical Transformer (EST) (2505.23086) advances the paradigm by shifting equivariant processing from purely harmonic (frequency) space to a spatial (orientation) domain:

Fourier Transform to Sphere: For each node (or edge) feature, the set of spherical harmonic coefficients is transformed into a spatial function on the sphere $\mathbb{S}^2$ , using

$f_{n,c}(\mathbf{p}) = \sum_{l=0}^L \sum_{m=-l}^l x_{n,c}^{(l,m)} Y^{(l,m)}(\mathbf{p})$

Spherical Sampling: The function $f_{n,c}(\mathbf{p})$ is discretized by evaluating at a uniform set of $S$ points $\{\mathbf{p}_s\}$ on the Fibonacci lattice. By the Nyquist theorem, $S \geq (2L)^2$ suffices for reconstruction.
Spherical Attention: Classic transformer-attention is applied between orientation tokens across the sphere:

$A_{s_i,s_j} = \frac{\exp(Q_{s_i} K_{s_j}^T)}{\sum_{k=1}^S \exp(Q_{s_i} K_{s_k}^T)}, \quad \widetilde{F}_{s_i} = \sum_{j=1}^S A_{s_i,s_j} V_{s_j}$

with $Q,K,V$ as pointwise linear projections, plus optional rotary-style relative orientation embeddings.

Inverse Spherical Fourier Transform: The attended, mixed features $\widetilde{F}$ are projected back to update the steerable (harmonic) representation.
Message Passing: Between nodes, joint node features and edge spherical harmonics are fused in the spatial domain, attended over orientations, and re-harmonized.

Critical to the EST is that all spatial operations and attention steps are designed such that, under any rotation $R$ , the output rotates consistently with the input—a property guaranteed by uniform spherical sampling and pointwise projection (2505.23086).

4. Theoretical Properties and Expressiveness

The EST architecture provably subsumes any tensor-product-based Clebsch–Gordan convolution with respect to function space:

Expressiveness: For any two steerable representations $\mathbf{u} \in \mathbb{V}_{0 \to l_1}$ and $\mathbf{v} \in \mathbb{V}_{0 \to l_2}$ , the tensor product $\mathbf{u} \otimes \mathbf{v}$ can be uniformly approximated by EST modules acting on their spherical transforms. This is achieved because the product of two harmonics on $\mathbb{S}^2$ decomposes into harmonics up to degree $l_1+l_2$ , and weighted sums of outer products across orientations—implemented by transformer attention—are at least as expressive as explicit CG products, with the added benefit of powerful nonlinear mixing that is not restricted to bilinear tensor algebra (2505.23086).
Nonlinearity: Unlike canonical equivariant GNNs, whose nonlinearity is induced via channelwise normalizations or shallow equivariant MLPs, EST enables direct nonlocal nonlinear interactions between different spatial orientations at any layer.
Sampling and Equivariance Error: Empirical tests show rotation errors stay below $10^{-3}$ radians using 128–256 Fibonacci points over six layers, a two-order-of-magnitude improvement relative to nonuniform sampling (2505.23086). The main limitation is residual equivariance error from imperfect sampling; exact spherical quadrature or learnable point sets are indicated as future directions.

5. Implementation, Complexity, and Architectural Innovations

Implementation distinguishes EST from traditional equivariant graph transformers:

Parameter Sharing: All $Q,K,V$ and per-head hybrid experts are shared across $l$ -channels, with only the inverse projection depending on $l$ .
Mixture of Hybrid Experts: At each node or edge, the type-0 (invariant) features are routed via a gating network combining (1) a steerable equivariant FFN (two depthwise $\mathbb{V}_l \to \mathbb{V}_l$ layers plus gating) and (2) a spherical FFN (per-orientation SiLU + linear) (2505.23086).
Computational Scaling: For $C$ channels, $S \approx (2L)^2$ spherical points, EST's spherical attention head operates in $\mathcal{O}(S C^2)$ time versus $\mathcal{O}(L^6 C^2)$ for full CG convolution at degree $L$ . Thus, EST enables access to higher $L$ at drastically lower computational cost.
Scalability: EST supports practical layer sizes ( $L$ up to 4) and graph sizes previously infeasible for TFN or SE(3)-Transformer, with attention-based mixing implemented via efficient batched matrix multiplies.

6. Empirical Results and Application Domains

EST demonstrates state-of-the-art performance on established molecular and materials benchmarks:

OC20 S2EF (energy and force regression): 8-layer EST (45M params) achieves 231 meV energy MAE and 16.1 meV/Å force MAE, exceeding EquiformerV2’s 232 meV/16.26 meV/Å (31M params).
IS2RE (Initial Structure to Relaxed Energy): 6-layer EST (32M) ranked first on both in-distribution (501 meV) and out-of-distribution (578 meV) tasks among all equivariant GNNs.
QM9 molecular property prediction: EST achieves SOTA on 8 of 12 targets (e.g., polarizability MAE 0.042 bohr³ vs. EquiformerV2’s 0.050) (2505.23086).
Ablations: EST can perfectly distinguish $n$ -fold symmetric structures when $L < n$ , where tensor-product models fail. Empirical equivariance error remains low with uniform sampling. Applications include force-field generation, molecular property regression, protein structure prediction, and generative modeling for chemical and biomolecular data (Tang, 15 Dec 2025, 2505.23086, Fuchs et al., 2020).

7. Open Problems and Future Directions

Limitations of current spherical equivariant transformers, including EST:

Sampling-induced equivariance error: Improving or eliminating residual SO(3)/SE(3) equivariance error via exact spherical quadrature or learnable sampling sets is an open challenge (2505.23086).
Combinatorial scaling for very high $L$ : While EST reduces scaling from $L^6$ (tensor-product) to $L^2$ , handling very high-bandwidth interactions in large biomolecular assemblies remains computationally challenging.
Integration of translation equivariance: Most EST and SO(3) equivariant models do not natively encode translation equivariance; full SE(3) handling is addressed by architectures such as SE(3)-Transformer (Fuchs et al., 2020), though typically at higher cost.
Generality and cross-domain transfer: A plausible implication is that merging spatial domain transformer mixing with frequency-domain group symmetries could expand geometric deep learning beyond physical environments to 3D computer vision and robotics tasks.

The EST framework indicates a new class of operations mixing group-theoretic feature spaces and transformer-style nonlinear global attention, representing a transition from bilinear tensor algebra to provably more expressive nonlinear, equivariant architectures for geometric machine learning (2505.23086).