Equiformer: SE(3)-Equivariant Graph Transformers
- Equiformer is a family of SE(3)/E(3)-equivariant graph Transformers that encode geometric symmetry via SO(3) irreducible representations for accurate 3D atomistic modeling.
- It replaces standard Transformer modules with equivariant linear layers, tensor products, and graph attention to maintain rotation and translation consistency.
- Equiformer V2 introduces eSCN convolutions and refined normalization techniques, achieving state-of-the-art performance on benchmarks like QM9, MD17, and OC20.
Equiformer is a family of SE(3)/E(3)-equivariant graph Transformer architectures specifically designed for 3D atomistic graphs, with initial development targeting applications in quantum chemistry, molecular simulation, and materials science. The defining characteristic of Equiformer models is the incorporation of geometric symmetry via irreducible representations (irreps) of the SO(3) group, enabling strict equivariance to rotation and translation in . This leads to sample-efficient learning and improved generalization properties on tasks where physics dictates such symmetries, such as predicting energies, forces, or adsorption properties in molecules and materials (Liao et al., 2022, Liao et al., 2023).
1. Motivation: Symmetry and Equivariance in Atomistic Machine Learning
Machine learning models for atomistic systems must respect fundamental physical symmetries:
- Translational invariance: observable properties (e.g., total energy) remain unchanged if all atomic coordinates are shifted by a constant vector.
- Rotational and inversion equivariance: vector and tensor quantities (e.g., forces, dipoles) must transform covariantly under rotations and reflections according to the action of the relevant symmetry group (SE(3) or E(3)).
Standard Transformers and GNNs, which are inherently permutation equivariant but not geometric-equivariant, fail to encode these symmetries natively. Consequently, such models must "learn from scratch" that, for example, molecular energies are invariant under rotation, leading to sample inefficiency and limited generalization to out-of-distribution geometries. Equiformer addresses this by replacing every major Transformer module with its equivariant counterpart, ensuring proper geometric inductive bias throughout the network (Liao et al., 2022).
2. Equiformer Architecture: Equivariant Transformers via Irreducible Representations
Equiformer generalizes the Transformer block to operate on features organized by SO(3) irreducible representations. Node features consist of stacked channels of various angular momentum degrees , each of dimension , transforming under the Wigner-D matrices . Linear layers and normalization are implemented per -block to preserve equivariance:
- Equivariant Linear: independent linear transformation per degree , commuting with the action of SO(3).
- Equivariant LayerNorm: channel-wise normalization within each -block; only scalars () get biases.
- Tensor Products: depth-wise Clebsch–Gordan couplings mix information across , maintaining rotational equivariance but with much lower computational cost than full coupling.
The architecture proceeds through alternating blocks of multi-head equivariant self-attention and equivariant feed-forward networks, mirroring the Transformer framework but with SO(3)-equivariant operations (Liao et al., 2022).
3. Equivariant Graph Attention Mechanism
Equiformer’s key innovation is Equivariant Graph Attention, a generalization of standard graph attention that ensures both attention weights and messages are equivariant:
- Message Feature Construction: For a pair of nodes , pre-message features are constructed by linear projections of the nodes and geometry-dependent tensor products with spherical harmonics , the latter modulated by a radial network over interatomic distances.
- MLP-based Attention Weights: Attention scalars are derived from the part of the message via a learnable MLP, guaranteeing rotation/inversion invariance of attention coefficients.
- Nonlinear Equivariant Messages: Message values passed along edges use gating and tensor products to enable nonlinear, fully equivariant updates.
- Multi-head and Add–Norm: As in standard Transformers, multiple parallel attention heads are used, followed by equivariant normalization.
These operations allow expressive and physically consistent information propagation across nodes while maintaining geometric constraints (Liao et al., 2022).
4. Advancements: Equiformer V2 and Scaling to High-Angular-Resolution
The original Equiformer architecture’s complexity scales cubically with angular resolution due to the cost of SO(3) tensor products, practically limiting earlier models to or 3. Equiformer V2 introduces the following architectural improvements to scale up efficiently (Liao et al., 2023):
- eSCN Convolutions: Replace SO(3) tensor products with edge-based SO(2)-diagonal eSCN convolutions, reducing scaling to linear in and making tractable for large datasets.
- Attention Re-normalization: Addition of LayerNorm before the attention activation stabilizes training at high channel count and degree.
- Separable Nonlinearity: Decouples scalar and equivariant nonlinearities, applying SiLU directly to half the scalar channels and S²-activation to higher degrees, improving gradient propagation.
- Separable LayerNorm: Applies channel normalization to scalars and block-wise RMS normalization to non-scalar features, preserving amplitude structure in high-degree blocks.
These changes allow for deeper models, greater angular resolution, and improved data efficiency.
5. Training Protocols and Empirical Performance
Training regimes:
- Depths of 6–18 blocks are typical.
- Channel counts depend on maximal degree; e.g., for QM9 tasks, , with 128, 64, 32 channels for respectively.
- Local attention is enforced via a radial cutoff; batch sizes and optimizer settings consistent with standard OCP practice.
Performance summary:
| Dataset | Metric | Equiformer (V1) | Equiformer V2 |
|---|---|---|---|
| QM9 | Energy MAE | ~30 meV | Improved 9/12 targets |
| MD17 | Force MAE | 2–7 meV/Å | – |
| OC20 S2EF | Energy MAE | 17.1 meV (V2) | 15.0 meV (V2) |
| OC20 S2EF | Force MAE | 15.6 meV/Å (V2) | 14.2 meV/Å (V2) |
| OC20 IS2RE | Energy MAE | ~0.44 eV | – |
| OC22 S2EF | Energy MAE | – | 22.88 meV (V2) |
| OC22 S2EF | Force MAE | – | 30.70 meV/Å (V2) |
Equiformer achieves (or surpasses) state-of-the-art performance on QM9, MD17, OC20, and OC22 tasks. V2 sets new records for data efficiency and accuracy, with a 2× reduction in DFT evaluations for force/adsorption targets and up to 9% lower force MAE compared to baselines (Liao et al., 2023, Liao et al., 2022).
6. Applications and Limitations in Materials Science
In recent benchmarking on the Open DAC 2023 dataset, the Equiformer V2 ("EqV2-ODAC") variant was trained to predict adsorption energies of CO₂ and H₂O in metal-organic frameworks (MOFs) (Brabson et al., 10 Jun 2025). The model, with 153 million parameters, incorporated per-atom features, edge embeddings based on a radial cutoff (8 Å), and eSCN-based equivariant attention consistent with prior design.
- Strengths: EqV2-ODAC ranks among the top three methods for total adsorption energy error (MAE ≈ 0.22 eV), matching the performance of CHGNet and MACE-MP-0, and outperforming classical force fields like UFF4MOF especially in the chemisorption regime.
- Limitations: Without explicit force-consistency training, EqV2-ODAC cannot generate physically consistent atomic forces and thus is inapplicable to geometry optimization or molecular dynamics within these systems. Empirical MAEs remain above the ~0.10 eV desired for reliable adsorptive predictions, and correlation with DFT remains modest.
A plausible implication is that while EqV2-ODAC demonstrates the power of large-scale equivariant graph Transformer architectures for adsorption-energy prediction, enforcing force consistency and expanding the diversity of data (especially for strongly interacting and flexible systems) will be required for routine and quantitative materials discovery applications (Brabson et al., 10 Jun 2025).
7. Computational Complexity, Scalability, and Outlook
- Complexity:
- Original Equiformer: per edge for SO(3) convolutions.
- Equiformer V2 (with eSCN): per edge, enabling high-degree models on large datasets.
- Scaling: All models use local attention (restricted edge neighborhoods, radius-based), ensuring linear growth with system size. Depth-wise tensor products and efficient block structures maintain tractability at scale.
- Parameter counts: Original models typically 3–10M parameters; V2 models (e.g., EqV2-ODAC) up to 153M.
- Runtime: QM9 ~10 min/epoch (A6000); OC20/OC22 large models scale proportionally.
Future directions identified in the literature include:
- Integrating explicit force and stress losses to ensure , unlocking force-consistent ML potentials for molecular simulations.
- Optimizing CUDA kernels for faster tensor product computation and memory efficiency.
- Mixed classical–ML architectures to leverage known chemistry for building blocks with equivariant attention for long-range interactions.
- Exploring sparse/efficient attention schemes for very large atomistic systems and dynamic -filtering techniques.
Equiformer demonstrates the feasibility and utility of geometric-equivalent Transformer architectures for 3D atomistic modeling, providing a template for future deep learning approaches in molecular and materials science (Liao et al., 2022, Liao et al., 2023, Brabson et al., 10 Jun 2025).