Clebsch–Gordan Transformer

Updated 29 November 2025

Clebsch–Gordan Transformer is a neural network architecture that integrates global attention with exact SO(3) equivariance and permutation symmetry for geometric deep learning.
It employs Clebsch–Gordan decomposition in both nonlinearity and attention mechanisms to achieve subquadratic computational complexity and support high-order rotation group representations.
It demonstrates state-of-the-art performance on tasks such as N-body simulation and molecular property prediction while offering notable improvements in speed and memory efficiency.

The Clebsch–Gordan Transformer (CGT) is a neural network architecture that unifies global attention, exact $\SO(3)$ equivariance, and permutation symmetry for geometric deep learning. By employing the Clebsch–Gordan decomposition in both its nonlinearity and attention mechanism, the CGT achieves subquadratic computational complexity in the number of tokens and supports high-order representation of rotation groups, outperforming previous equivariant transformers in diverse 3D learning tasks (Howell et al., 28 Sep 2025, Kondor et al., 2018).

1. Mathematical Foundation: Clebsch–Gordan Decomposition and $\SO(3)$ Equivariance

The core mathematical operation underlying the CGT is the Clebsch–Gordan (CG) decomposition. For irreducible representations (irreps) $V^{\ell_1}$ and $V^{\ell_2}$ of $\SO(3)$, the tensor (Kronecker) product decomposes as:

$V^{\ell_1} \otimes V^{\ell_2} \cong \bigoplus_{\ell=|\ell_1-\ell_2|}^{\ell_1+\ell_2} V^\ell,$

implemented by sparse CG matrices $C^{\ell}_{\ell_1 \ell_2}$ whose entries vanish unless the angular momentum selection rule $m = m_1 + m_2$ and parity constraints are satisfied. This structure enables construction of layers that are strictly equivariant under 3D rotations: each feature transforms according to a specific $V^\ell$ , and all interactions among features are mediated by the algebra of irreps and the CG coefficients (Howell et al., 28 Sep 2025, Kondor et al., 2018).

2. Clebsch–Gordan Attention: Global $\SO(3)$-Equivariant Correlation

CGT replaces standard dot-product attention with "Clebsch–Gordan Convolution," a global equivariant correlation structured on spherical harmonic tensor fields. For $N$ tokens, each carrying features $f_i^\ell\in\mathbb{R}^{(2\ell+1)\times m_\ell}$ for $\ell=0,\dots,L$ , query, key, and value projections yield:

$q_i^\ell,~k_i^\ell,~v_i^\ell = W_Q^\ell(f_i),~W_K^\ell(f_i),~W_V^\ell(f_i)$

The key global correlation is computed as:

$(q^\ell \star k^{\ell'})_i^J = C^J_{\ell\,\ell'} \sum_{j=1}^N q_j^\ell \otimes k_{i-j}^{\ell'}$

and the total output at each order $J$ is:

$\hat u_i^J = \sum_{\ell=0}^{L} \sum_{\ell'=0}^{L} (q^\ell\star k^{\ell'})^J_i$

To avoid $\mathcal{O}(N^2)$ cost, the token dimension is handled with a Fast Fourier Transform (FFT):

$\hat q^\ell(\omega) = \sum_{i=1}^N e^{-2\pi i \omega i/N} q_i^\ell$

with the batched CG-tensor operation and an inverse FFT restoring the spatial domain, achieving $\mathcal{O}(N \log N)$ scaling (Howell et al., 28 Sep 2025).

3. Nonlinearity: Clebsch–Gordan Product as Equivariant Activation

In contrast to pointwise nonlinearities that break Fourier domain structure, Clebsch–Gordan-based architectures—including CGT and Clebsch–Gordan Nets—employ the quadratic, equivariant tensor product followed by projection via CG coefficients. For fragments $f^{s}_{\ell_1} \in \mathbb{C}^{2\ell_1+1}$ , $f^{s}_{\ell_2} \in \mathbb{C}^{2\ell_2+1}$ , the coupled output is

$[g^{s}_{\ell_1,\ell_2 \to L}]_M = \sum_{m_1 = -\ell_1}^{\ell_1} \sum_{m_2 = -\ell_2}^{\ell_2} \langle \ell_1 m_1; \ell_2 m_2 | L M \rangle f^{s}_{\ell_1, m_1} f^{s}_{\ell_2, m_2}$

ensuring each output channel transforms as irreducible $L$ and the mapping is exactly $\SO(3)$-equivariant (Kondor et al., 2018).

4. Efficiency: Exploiting Sparsity and Computational Complexity

A naïve all-pairs tensor product would result in $\mathcal{O}(L^6)$ scaling for the harmonic degree, but the selection rules and sparsity of the CG matrices reduce this to $\mathcal{O}(L^3)$ . Each $C^J_{\ell\ell'}$ contains only $(2\ell+1)(2\ell'+1)$ nonzero elements. Combined with the FFT on the token axis, the complete attention block achieves:

$\mathcal{O}(N\log N)$ in input size $N$
$\mathcal{O}(L^3)$ in maximum harmonic order $L$
Linear memory $\mathcal{O}(N+L^2)$ per layer

This efficiency enables practical deployment with high-order irreps ( $L\geq 6$ ) on modern GPU hardware (Howell et al., 28 Sep 2025).

5. Permutation Equivariance and Set Structure

FFT-based global attention inherently breaks the permutation symmetry required for processing sets. CGT addresses this with two strategies:

Filter Weight-Tying Across Frequencies: Ensures learned filters commute with arbitrary permutations.
Data Augmentation and Regularization: Incorporates random token permutations and permutation-equivariant losses (e.g., via Deep Sets or spectral graph attention formulations).

Alternatively, replacing the token-axis FFT with a Graph Fourier Transform on the Laplacian yields strict permutation equivariance, as Laplacian eigenvectors transform naturally under permutations (Howell et al., 28 Sep 2025).

6. Empirical Performance and Applications

CGT demonstrates state-of-the-art or superior results in domains requiring both $\SO(3)$-equivariance and global context:

Task	Metric	CGT Result	Baseline Comparison
N-body Simulation	MSE $_x$ , MSE $_v$	0.0041/0.0065	SE(3)-Transformer: 0.0076/0.075
QM9 Molecular Properties	MAE $_\mu$	0.21 D	SE(3)-Transformer: 0.51 D
ModelNet40 Classification	Accuracy	89.3%	SEGNN: 90.5%; SE(3)-Transformer: 88.1%
Robotic Grasping (2048 pts)	Rot. Error	0.025 rad	DGCNN: 0.031

CGT further achieves significant memory and speed improvements versus local or low-order equivariant attention methods; for example, at $N=20$ tokens, 8 GB GPU memory versus 12 GB for SE(3)-Transformer, and $1.8\times$ throughput at $N=40$ . CGT also scales to 4096-point clouds, where SE(3)-Transformer is out-of-memory (Howell et al., 28 Sep 2025).

7. Limitations and Future Research Directions

The principal constraints of Clebsch–Gordan Transformers include:

The cubic $\mathcal{O}(L^3)$ scaling in harmonic order, which can be limiting if $L \gg 10$ is required.
Small numerical errors due to FFT-based attention may arise in very long token sequences.

Proposed research extensions involve:

Multi-scale schemes (e.g., fast multipole or Barnes–Hut style) to approach linear $\mathcal{O}(N)$ scaling.
Learned or sparse/low-rank CG convolutions to further reduce harmonic cost towards $\mathcal{O}(L^2 \log L)$ .
Integrating continuous-depth or state-space layers for seamless modeling of long-range interactions (Howell et al., 28 Sep 2025).

8. Relation to Clebsch–Gordan Nets and Generalization

Clebsch–Gordan Nets extend the same tensor-product nonlinearity to fully Fourier-space, $\SO(3)$-equivariant spherical CNNs, using precomputed CG matrices for transformation between irreps and omitting forward/inverse Fourier transforms after initialization. This methodology generalizes to any compact group whose irreducible representations and CG coefficients are known, supporting construction of G-equivariant neural architectures with precise symmetry guarantees (Kondor et al., 2018).

In summary, the Clebsch–Gordan Transformer establishes a scalable, expressive, and symmetry-capable foundation for deep learning on geometric and physical data, demonstrating efficient attention and nonlinearity mechanisms applicable to a wide range of scientific and computational domains (Howell et al., 28 Sep 2025, Kondor et al., 2018).