Papers
Topics
Authors
Recent
2000 character limit reached

SE(3)-Transformers

Updated 19 December 2025
  • SE(3)-Transformers are geometric deep learning models that encode rigid motion symmetries in 3D data using group-theoretic representations.
  • They integrate equivariant attention, message passing, and tensor field networks to ensure outputs transform predictably under rotations and translations.
  • These models deliver robust performance in molecular property prediction, 3D registration, and robotic policy learning, though they incur high computational costs.

The SE(3)-Transformer is a geometric deep learning architecture designed for domains where data possess intrinsic symmetries under the special Euclidean group SE(3)—the group of 3D rotations and translations. By encoding SE(3)-equivariance at every layer, these models ensure that outputs transform predictably under arbitrary rigid motions of the input. This property is critical for robust, sample-efficient learning in molecular property prediction, protein design, 3D perception, point cloud registration, and robotic policy learning, where the laws of physics or measurement apparatuses impose such symmetry constraints. The SE(3)-Transformer combines equivariant message passing, tensor field networks, and group-theoretic attention mechanisms to achieve end-to-end provable equivariance for data structured as graphs, point clouds, or volumetric grids.

1. Group-Theoretic Foundations and Representations

SE(3) is the group of rigid motions in 3D: g=(R,t)g = (R, t) with R∈SO(3)R \in SO(3) (rotations) and t∈R3t \in \mathbb{R}^3 (translations). Its action on a point xx is g⋅x=Rx+tg \cdot x = R x + t. Equivariance for a neural network ff is formalized as f(g⋅x)=g⋅f(x)f(g \cdot x) = g \cdot f(x) for all g∈SE(3)g \in SE(3). To encode this property in networks, features are organized by type—so-called irreducible representations (irreps) of SO(3)SO(3): type-ll features h(l)h^{(l)} transform under the Wigner-D matrix D(l)(R)D^{(l)}(R) as h(l)→D(l)(R)h(l)h^{(l)} \to D^{(l)}(R) h^{(l)}. This structure is foundational for tensor field networks, steerable CNNs, and SE(3)-equivariant transformers (Fuchs et al., 2020, Siguenza et al., 19 Oct 2025, Kundu et al., 24 May 2024).

2. Equivariant Attention and Message Passing Mechanisms

The SE(3)-Transformer generalizes self-attention to explicitly respect SE(3) symmetries. Core to the approach are attention weights that are invariant, with value updates that are equivariant. For node indices i,ji, j and degree ll, the attention weight is computed as

αij=softmaxj(∑l=0L∑m=−ll⟨hi(l),hj(l)⟩Ylm(r^ij)wlm)\alpha_{ij} = \mathrm{softmax}_j\Bigl( \sum_{l=0}^L \sum_{m=-l}^l \langle h_i^{(l)}, h_j^{(l)} \rangle Y_l^m(\hat r_{ij}) w_{lm} \Bigr)

where Ylm(r^ij)Y_l^m(\hat r_{ij}) is the spherical harmonic evaluated at the unit vector r^ij\hat r_{ij} along edge ijij, and wlmw_{lm} are learnable parameters. The value update is a tensor contraction over channel and angular indices: hi(L)′=∑j∈N(i)∑l1,l2(hj(l1)⊗[W(l1,l2,L)Yl2(r^ij)])h_i^{(L)\prime} = \sum_{j \in \mathcal{N}(i)} \sum_{l_1, l_2} \left( h_j^{(l_1)} \otimes [W^{(l_1, l_2, L)} Y_{l_2}(\hat r_{ij})] \right) where tensor products and contractions are defined by group-theoretic Clebsch–Gordan coefficients. This guarantees that feature updates transform appropriately under SE(3).

Self-interaction and pairwise convolution kernels are parameterized as radial profiles times angular bases: W(k,l)→J(rij)=∑m=−JJRk,l,J(∣rij∣)YJm(r^ij)Cm(k,l)→JW^{(k,l)\to J}(r_{ij}) = \sum_{m=-J}^J R_{k,l,J}(|r_{ij}|) Y_J^m(\hat r_{ij}) C_m^{(k,l)\to J} with Rk,l,JR_{k,l,J} a learnable radial network and Cm(k,l)→JC_m^{(k,l)\to J} Clebsch–Gordan coefficients.

3. Implementation Paradigms and Variants

Modern implementations of SE(3)-Transformers include:

  • Continuous Spherical Harmonic Formulation: Features stored per node as dictionaries hi={l:Tensor [Cl,2l+1]}h_i = \{ l: \text{Tensor}~[C_l, 2l+1] \} for types l=0,1,2,…l=0,1,2,\ldots. All kernel mixing and nonlinearities are organized in irreducible basis, and layers use equivariant norms or Clebsch–Gordan decompositions (Fuchs et al., 2020, Siguenza et al., 19 Oct 2025).
  • Discrete Anchor-Based Realizations: E.g., SE3ET adopts a set of finite anchor directions (octahedral SO(3) subgroup), storing features as X(p)∈RA×CX(p) \in \mathbb{R}^{A \times C} per point and enforcing equivariance by permuting anchor indices under discrete group actions (Lin et al., 23 Jul 2024).
  • Fourier/Wigner Domain Architectures: For volumetric or voxel input, features are functions f:R3×SO(3)→CCf: \mathbb{R}^3 \times SO(3) \to \mathbb{C}^C and are manipulated in their Fourier (Wigner-D) decompositions. Nonlinearities and normalization act only on the SO(3)SO(3)-norms, ensuring equivariance (Kundu et al., 24 May 2024).
  • Bi-Equivariant and Multi-Input Extensions: For problems like point cloud registration/assembly, architectures can be made equivariant under independent SE(3)×SE(3)SE(3)\times SE(3) actions (bi-equivariant transformers). Final predicted transformations are extracted via SVD-based projections from pooled equivariant features (Wang et al., 12 Jul 2024).

All major implementations are now available in open-source libraries, e.g., DeepChem Equivariant (Siguenza et al., 19 Oct 2025) and domain-specific research repositories.

4. Applications: Molecular Learning, Vision, Robotics, and 3D Registration

Applications span several fields where geometric symmetries are critical:

  • Molecular Property Prediction: SE(3)-Transformers achieve competitive or superior results on QM9, with MAEs within ~10% of state-of-the-art baselines. For example, DeepChem Equivariant yields α=0.182\alpha = 0.182 bohr3^3, εHOMO=62\varepsilon_{HOMO} = 62 meV, and εLUMO=39\varepsilon_{LUMO} = 39 meV (Siguenza et al., 19 Oct 2025).
  • Volumetric and Point Cloud Perception: In 3D shape classification, SE(3)-Transformers and steerable transformer extensions achieve robust performance under arbitrary rotations and translations, e.g., 86.8% accuracy on ModelNet10 under SO(3)SO(3) perturbations with no augmentation (Kundu et al., 24 May 2024).
  • Robotic Manipulation Policies: EquAct demonstrates SE(3)-equivariant transformers outperforming non-equivariant and multi-view baselines (SAM2ACT, 3DDA) on multi-task robotic benchmarks under varying 3D scene perturbations. Success rates improve drastically under full SE(3)SE(3) invariance (53.3% vs. 37% for baselines under hardest generalization regime) (Zhu et al., 27 May 2025).
  • 3D Registration and Alignment: SE3ET and BITR leverage SE(3)-equivariant attention for low-overlap point cloud registration and assembly, yielding high recall and generalization to unseen environments or transformations. BITR further supports bi-equivariance: the alignment of two inputs is stable under independent rigid motions, swapping, and scaling (Wang et al., 12 Jul 2024, Lin et al., 23 Jul 2024).

5. Architectural and Training Details

A canonical workflow for molecular learning in DeepChem Equivariant (Siguenza et al., 19 Oct 2025) includes:

  1. Featurization: Use EquivariantGraphFeaturizer to compute atomic graphs with precomputed spherical harmonics up to degree LL and neighbor cutoffs.
  2. Model Construction: Instantiate SE3Transformer with tunable hyperparameters: layers (NN), channels (CC), degrees (LL), attention heads (HH), cutoff, and pooling strategies.
  3. Training: Standard loss is MAE or MSE with Adam (lr=1e−4lr=1e-4), optional learning rate scheduling. Equivariance is verified via randomized coordinate transformations.
  4. Benchmarks and Ablation: Removing attention or lowering irreps/cutoff significantly degrades performance (MAE up by 10–15%; cutoff reduction incurs ~8% worse error).

The architecture accommodates extensions such as new irrep types, custom layer sub-classes, or swapping out kernel backends (e.g., with E3NN).

6. Limitations and Research Challenges

Despite their rigor, SE(3)-Transformers incur significant computational cost:

  • Complexity: Dot-product attention is O(N2)O(N^2) in the number of input points or voxels; spherical harmonic calculations scale with angular cutoff LL.
  • Discretization: Some implementations use finite rotation groups, introducing approximation errors. Higher angular resolution (large LL) improves fidelity but is computationally demanding (Lin et al., 23 Jul 2024).
  • Expressivity vs. Efficiency: There is a tradeoff between expressivity (high LL, large channel budget) and memory/runtime; diminishing returns are found beyond L=2L=2 or $3$ for many tasks (Siguenza et al., 19 Oct 2025, Fuchs et al., 2020).
  • Beyond Rigid Symmetry: Real-world data may exhibit additional symmetries (swap, scale)—BITR demonstrates how these can be built in, but this is still an area of open research (Wang et al., 12 Jul 2024).

7. Extensions and Impact

SE(3)-Transformers have catalyzed adoption of group-theoretic deep learning across disciplines, enabling models to generalize well under unseen rigid motion and substantially reducing the need for exhaustive data augmentation. Extensions include:

  • Iterative Refinement: Iterative SE(3)-Transformers refine both features and node positions through multiple passes, capturing long-range interactions more accurately for tasks like protein structure prediction or energy minimization (Fuchs et al., 2021).
  • Equivariant Multi-Modal Fusion: EquAct and similar frameworks directly integrate SE(3)-equivariant geometric pathways with SE(3)-invariant language information, supporting multimodal policy learning (Zhu et al., 27 May 2025).
  • Bi-Equivariant and Homogeneous-Space Models: SE(3)-bi-equivariant transformers align point clouds under independent motions and can incorporate further symmetries (swap/scale), yielding strong generalization on composition and assembly tasks (Wang et al., 12 Jul 2024).
  • SE(3)-Equivariant Neural Rendering: Architectures leveraging SE(3)-equivariant transformers in ray space enable reconstruction and view synthesis invariant to camera/object pose with no test-time augmentation (Xu et al., 2022).

SE(3)-Transformer variants thus form a cornerstone of modern equivariant learning, with ongoing developments targeting improved efficiency, expressive power, integration with higher-order input modalities, and seamless inclusion of additional symmetry priors (Siguenza et al., 19 Oct 2025, Fuchs et al., 2020, Fuchs et al., 2021, Kundu et al., 24 May 2024, Lin et al., 23 Jul 2024, Wang et al., 12 Jul 2024, Xu et al., 2022, Zhu et al., 27 May 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SE(3)-Transformers.