Papers
Topics
Authors
Recent
Search
2000 character limit reached

E2Former-V2: Scalable Equivariant Transformer

Updated 30 January 2026
  • The paper introduces EAAS, an algebraic sparsification method that reduces dense SO(3) tensor product computations, significantly lowering memory and compute overhead.
  • It implements a fused, on-chip streaming attention kernel that processes neighbor interactions in a single pass, ensuring efficient use of SRAM and linear activation memory.
  • Benchmark results demonstrate up to 20× improved throughput and scalability, enabling large-scale molecular energy and force predictions on commodity GPUs.

E2Former-V2 is a hardware-aware, node-centric equivariant transformer architecture for scalable modeling of large 3D atomistic systems in the context of SO(3)-equivariant graph neural networks (EGNNs). It eliminates memory and compute bottlenecks typical in mainstream equivariant architectures by combining algebraic sparsification of SO(3) tensor products (Equivariant Axis-Aligned Sparsification, EAAS) and a fused, on-the-fly attention kernel that operates with linear activation memory. E2Former-V2 is designed to efficiently train large models on commodity GPU hardware, demonstrated by substantial improvements in throughput and scalability while maintaining state-of-the-art predictive accuracy on tasks such as molecular energy and force prediction (Huang et al., 23 Jan 2026).

1. Architectural Principles

E2Former-V2 adopts a strictly node-centric message update paradigm, contrasting with conventional EGNNs, which execute edge-centric operations requiring O(N·K) activation memory—here, NN is the number of atoms, KK is the average number of neighbors. In E2Former-V2, the update rule for each node ii takes the form:

h^i=(jN(i)αij[hjR(rj)])R(ri)\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)

where αij\alpha_{ij} is a geometry-aware scalar attention weight, hjh_j is the source node feature, R(rj)\mathcal{R}(r_j) is a position-based encoding, and \otimes denotes tensor product. This factorization, introduced in E2Former-V1, isolates the edge score αij\alpha_{ij} from high-dimensional tensor products, thereby reducing activation memory. E2Former-V2 further replaces dense SO(3) tensor products with the algebraically sparse EAAS operator and realizes attention computation as a streaming reduction retained in on-chip SRAM for optimal performance.

2. Equivariant Axis-Aligned Sparsification (EAAS)

EAAS innovates the evaluation of equivariant tensor products fundamental to SO(3)-based models. In standard Clebsch–Gordan (CG) coupling, each output (h(i)R(f)(r))mo(o)(h^{(\ell_i)} \otimes \mathcal{R}^{(\ell_f)}(r))^{(\ell_o)}_{m_o} requires summation over all magnetic indices with CG coefficients. By applying a frame rotation RR aligning rr with the zz-axis (as proved in Lemma 4.1), the solid spherical harmonics reduce to Rm(f)(Rr)δm,0\mathcal{R}^{(\ell_f)}_m(R r) \propto \delta_{m,0}, making the coupling maximally sparse: only terms with mf=0m_f = 0 contribute. The block-diagonal, permutation-like re-indexing operator P\mathcal{P} for each irreducible block is defined as:

(P(h~))mo(o):={C(i,mo),(f,0)(o,mo)h~mo(i),if LΣ even 2(1)moC(i,mo),(f,0)(o,mo)h~mo(i),if LΣ odd(\mathcal{P}(\tilde{h}))^{(\ell_o)}_{m_o} := \begin{cases} C^{(\ell_o,m_o)}_{(\ell_i,m_o),(\ell_f,0)}\, \tilde{h}^{(\ell_i)}_{m_o}, &\text{if } L_\Sigma \text{ even} \ -2\,(-1)^{m_o}\, C^{(\ell_o,m_o)}_{(\ell_i,-m_o),(\ell_f,0)}\, \tilde{h}^{(\ell_i)}_{-m_o}, &\text{if } L_\Sigma \text{ odd} \end{cases}

where LΣ=i+f+oL_\Sigma = \ell_i + \ell_f + \ell_o. The result is an elimination of dense summations: each output order mom_o is determined by exactly one input order mim_i. After rotating back to the original frame,

(h(i)R(f)(r))(o)=P(h~)@DR1(o)(h^{(\ell_i)} \otimes \mathcal{R}^{(\ell_f)}(r))^{(\ell_o)} = \mathcal{P}(\tilde{h}) @ D_{R^{-1}}^{(\ell_o)}

All Wigner-6j recoupling is subsumed into this sparse re-indexing, obviating the O(L6L^6) and O(L3L^3) tensor products of previous approaches.

3. Streaming On-the-Fly Equivariant Attention

E2Former-V2 implements attention via a Triton fused kernel that operates directly on neighbor indices. For each node ii, attention scores and value accumulations are computed in a single streaming pass over its KK neighbors, avoiding materialization of any N×K edge arrays. The core loop (Algorithm 1) proceeds as follows:

  • Indirectly gather kjk_j and hjh'_j via index tensor IZN×KI \in \mathbb{Z}^{N \times K}.
  • For each neighbor kk:
    • Compute score sτqi,kj+b(rij)s \leftarrow \tau \langle q_i, k_j \rangle + b(r_{ij}).
    • Update running max μ\mu', normalization zz, and accumulated value AA (with radial weights φ(rij)\varphi(r_{ij})).
  • Final output is mi=A/zm_i = A / z.

This kernel fuses query-key scoring, softmax normalization, and value aggregation into a single reduction, ensuring all intermediate activations reside in SRAM. The memory footprint is strictly O(N), as only node-centric features are retained; overall arithmetic complexity remains O(N·K·d), with every key and value loaded exactly once per interaction.

4. Quantitative Evaluation and Scalability

Performance Metrics

E2Former-V2 demonstrates significant improvements in both efficiency and predictive accuracy:

Benchmark Compared Baseline E2Former-V2 (Direct/Cons.) Comments
SPICE Dimers MACE-Large: 0.54 (MAE) E2V2-Direct: 0.28 (MAE) Forces: 2.40 (vs. 6.62) meV/Å
OMol25 Val-Comp eSEN-small: 1.27 (MAE) E2V2-Cons.: 1.27 (MAE) Comparable force errors
Steps/s at N=1k Allegro/EquiformerV2 140 (E2V2-Direct), ≈10× faster Sustained speedup
Steps/s at N=100k Prior EGNNs OOM E2V2-Cons.: 0.29 Only EGNN to scale

Latency comparisons reveal a consistent 6×\gtrsim6\times speedup in both first- and second-order tensor product operations against SO(3)-based baselines. Attention kernel throughput on NVIDIA H20 saturates the compute roofline, achieving up to %%%%36(h(i)R(f)(r))(o)=P(h~)@DR1(o)(h^{(\ell_i)} \otimes \mathcal{R}^{(\ell_f)}(r))^{(\ell_o)} = \mathcal{P}(\tilde{h}) @ D_{R^{-1}}^{(\ell_o)}37%%%% higher TFLOPS and using less than 1/5 the memory of conventional implementations.

5. Implementation and Hardware Integration

Custom Triton kernels are central to the architecture’s performance. By fusing both query-key (QK) and value (V) paths, the kernel maximizes reuse of on-chip SRAM, minimizes high-bandwidth memory (HBM) traffic, and eliminates the synchronization overhead characteristic of multi-kernel or PyTorch-level operations. The sparse neighbor index tensor ensures that indirect gathers for neighbor features do not require materializing N×K arrays. Online normalization via a running maximum μ\mu and normalization factor zz preserves numerical stability and avoids storing full softmax matrices.

Kernels are specifically tuned to the NVIDIA H20 architecture; compute tiles pack multiple atoms and attention heads into single blocks, optimized for both FP16 and FP32 utilization. The resulting activation memory remains O(N) across all geometry and attention states, enabling batch and system sizes exceeding tenfold those permissible in prior models on commodity GPUs.

6. Operator Fusion Strategy and Comparative Frameworks

E2Former-V2’s operator fusion strategy is illustrated by contrasting three approaches:

  • Traditional EGNNs: Materialized edge-centric scores and values, yielding O(N·K) memory overhead.
  • FlashAttention: Uses pre-built sparse masks; intermediate arrays are still constructed.
  • E2Former-V2 fused kernel: Both QK scoring and value aggregation are combined into a multi-stage Triton kernel, maintaining only node-level buffers in SRAM and performing softmax/value reduction in a single pass.

This fusion approach, together with EAAS, enables a 20×\times improvement in arithmetic throughput (TFLOPS) and an order-of-magnitude faster inference at moderate atom counts, while permitting reliable scaling to 100k atoms—a feature unique among transformer-based EGNNs at publication.

7. Applications, Availability, and Implications

E2Former-V2 is validated on challenging 3D atomistic modeling tasks, including energy and force prediction for small molecules, solvent-solute systems, and large-scale molecular benchmarks. Its efficient training and inference profile demonstrates the feasibility of deploying large equivariant transformers on widely available GPUs. The source code and implementation details are accessible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2, supporting reproducibility and further experimentation. A plausible implication is the opening of hardware-efficient, scalable equivariant modeling for broader adoption in computational chemistry, materials science, and related disciplines (Huang et al., 23 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to E2Former-V2.