E2Former-V2: Scalable Equivariant Transformer

Updated 30 January 2026

The paper introduces EAAS, an algebraic sparsification method that reduces dense SO(3) tensor product computations, significantly lowering memory and compute overhead.
It implements a fused, on-chip streaming attention kernel that processes neighbor interactions in a single pass, ensuring efficient use of SRAM and linear activation memory.
Benchmark results demonstrate up to 20× improved throughput and scalability, enabling large-scale molecular energy and force predictions on commodity GPUs.

E2Former-V2 is a hardware-aware, node-centric equivariant transformer architecture for scalable modeling of large 3D atomistic systems in the context of SO(3)-equivariant graph neural networks (EGNNs). It eliminates memory and compute bottlenecks typical in mainstream equivariant architectures by combining algebraic sparsification of SO(3) tensor products (Equivariant Axis-Aligned Sparsification, EAAS) and a fused, on-the-fly attention kernel that operates with linear activation memory. E2Former-V2 is designed to efficiently train large models on commodity GPU hardware, demonstrated by substantial improvements in throughput and scalability while maintaining state-of-the-art predictive accuracy on tasks such as molecular energy and force prediction (Huang et al., 23 Jan 2026).

1. Architectural Principles

E2Former-V2 adopts a strictly node-centric message update paradigm, contrasting with conventional EGNNs, which execute edge-centric operations requiring O(N·K) activation memory—here, $N$ is the number of atoms, $K$ is the average number of neighbors. In E2Former-V2, the update rule for each node $i$ takes the form:

$\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$

where $\alpha_{ij}$ is a geometry-aware scalar attention weight, $h_j$ is the source node feature, $\mathcal{R}(r_j)$ is a position-based encoding, and $\otimes$ denotes tensor product. This factorization, introduced in E2Former-V1, isolates the edge score $\alpha_{ij}$ from high-dimensional tensor products, thereby reducing activation memory. E2Former-V2 further replaces dense SO(3) tensor products with the algebraically sparse EAAS operator and realizes attention computation as a streaming reduction retained in on-chip SRAM for optimal performance.

2. Equivariant Axis-Aligned Sparsification (EAAS)

EAAS innovates the evaluation of equivariant tensor products fundamental to SO(3)-based models. In standard Clebsch–Gordan (CG) coupling, each output $(h^{(\ell_i)} \otimes \mathcal{R}^{(\ell_f)}(r))^{(\ell_o)}_{m_o}$ requires summation over all magnetic indices with CG coefficients. By applying a frame rotation $K$ 0 aligning $K$ 1 with the $K$ 2-axis (as proved in Lemma 4.1), the solid spherical harmonics reduce to $K$ 3, making the coupling maximally sparse: only terms with $K$ 4 contribute. The block-diagonal, permutation-like re-indexing operator $K$ 5 for each irreducible block is defined as:

$K$ 6

where $K$ 7. The result is an elimination of dense summations: each output order $K$ 8 is determined by exactly one input order $K$ 9. After rotating back to the original frame,

$i$ 0

All Wigner-6j recoupling is subsumed into this sparse re-indexing, obviating the O( $i$ 1) and O( $i$ 2) tensor products of previous approaches.

3. Streaming On-the-Fly Equivariant Attention

E2Former-V2 implements attention via a Triton fused kernel that operates directly on neighbor indices. For each node $i$ 3, attention scores and value accumulations are computed in a single streaming pass over its $i$ 4 neighbors, avoiding materialization of any N×K edge arrays. The core loop (Algorithm 1) proceeds as follows:

Indirectly gather $i$ 5 and $i$ 6 via index tensor $i$ 7.
For each neighbor $i$ $i$ 8:
- Compute score $i$ 9.
- Update running max $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 0, normalization $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 1, and accumulated value $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 2 (with radial weights $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 3).
Final output is $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 4.

This kernel fuses query-key scoring, softmax normalization, and value aggregation into a single reduction, ensuring all intermediate activations reside in SRAM. The memory footprint is strictly O(N), as only node-centric features are retained; overall arithmetic complexity remains O(N·K·d), with every key and value loaded exactly once per interaction.

4. Quantitative Evaluation and Scalability

Performance Metrics

E2Former-V2 demonstrates significant improvements in both efficiency and predictive accuracy:

Benchmark	Compared Baseline	E2Former-V2 (Direct/Cons.)	Comments
SPICE Dimers	MACE-Large: 0.54 (MAE)	E2V2-Direct: 0.28 (MAE)	Forces: 2.40 (vs. 6.62) meV/Å
OMol25 Val-Comp	eSEN-small: 1.27 (MAE)	E2V2-Cons.: 1.27 (MAE)	Comparable force errors
Steps/s at N=1k	Allegro/EquiformerV2	140 (E2V2-Direct), ≈10× faster	Sustained speedup
Steps/s at N=100k	Prior EGNNs OOM	E2V2-Cons.: 0.29	Only EGNN to scale

Latency comparisons reveal a consistent $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 5 speedup in both first- and second-order tensor product operations against SO(3)-based baselines. Attention kernel throughput on NVIDIA H20 saturates the compute roofline, achieving up to %%%%36 $i$ 037%%%% higher TFLOPS and using less than 1/5 the memory of conventional implementations.

5. Implementation and Hardware Integration

Custom Triton kernels are central to the architecture’s performance. By fusing both query-key (QK) and value (V) paths, the kernel maximizes reuse of on-chip SRAM, minimizes high-bandwidth memory (HBM) traffic, and eliminates the synchronization overhead characteristic of multi-kernel or PyTorch-level operations. The sparse neighbor index tensor ensures that indirect gathers for neighbor features do not require materializing N×K arrays. Online normalization via a running maximum $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 8 and normalization factor $\hat{h}_i = \Big( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \cdot [h_j \otimes \mathcal{R}(r_j)] \Big) \otimes \mathcal{R}(r_i)$ 9 preserves numerical stability and avoids storing full softmax matrices.

Kernels are specifically tuned to the NVIDIA H20 architecture; compute tiles pack multiple atoms and attention heads into single blocks, optimized for both FP16 and FP32 utilization. The resulting activation memory remains O(N) across all geometry and attention states, enabling batch and system sizes exceeding tenfold those permissible in prior models on commodity GPUs.

6. Operator Fusion Strategy and Comparative Frameworks

E2Former-V2’s operator fusion strategy is illustrated by contrasting three approaches:

Traditional EGNNs: Materialized edge-centric scores and values, yielding O(N·K) memory overhead.
FlashAttention: Uses pre-built sparse masks; intermediate arrays are still constructed.
E2Former-V2 fused kernel: Both QK scoring and value aggregation are combined into a multi-stage Triton kernel, maintaining only node-level buffers in SRAM and performing softmax/value reduction in a single pass.

This fusion approach, together with EAAS, enables a 20 $\alpha_{ij}$ 0 improvement in arithmetic throughput (TFLOPS) and an order-of-magnitude faster inference at moderate atom counts, while permitting reliable scaling to 100k atoms—a feature unique among transformer-based EGNNs at publication.

7. Applications, Availability, and Implications

E2Former-V2 is validated on challenging 3D atomistic modeling tasks, including energy and force prediction for small molecules, solvent-solute systems, and large-scale molecular benchmarks. Its efficient training and inference profile demonstrates the feasibility of deploying large equivariant transformers on widely available GPUs. The source code and implementation details are accessible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2, supporting reproducibility and further experimentation. A plausible implication is the opening of hardware-efficient, scalable equivariant modeling for broader adoption in computational chemistry, materials science, and related disciplines (Huang et al., 23 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to E2Former-V2.