EigenFold: Protein Structure Diffusion Modeling

Updated 28 February 2026

The paper introduces a diffusion process on backbone normal-mode coordinates to efficiently model protein tertiary structure ensembles.
It employs a cascading-resolution strategy that recovers global structure first, reducing inference steps from O(10^3–10^4) to O(10^2).
The framework integrates an SE(3)-equivariant GNN with denoising-score matching for robust uncertainty quantification and ensemble validation.

EigenFold is a generative modeling framework for protein structure prediction based on diffusion processes in the space of backbone conformations. Unlike single-structure deterministic predictors, EigenFold is designed to explicitly sample from a distribution over possible tertiary structures conditioned on a protein sequence, thereby enabling the modeling of conformational ensembles and structural uncertainty. The core innovation is a diffusion process defined on the normal-mode coordinates of a graph representation of the protein backbone, yielding a cascading-resolution generative strategy that is computationally efficient and theoretically grounded (Jing et al., 2023).

1. Mathematical Foundations

At its foundation, EigenFold treats the protein backbone as a mechanical system of overdamped harmonic oscillators. For a protein of $m$ residues, a graph $G=(\mathcal V, \mathcal E)$ is constructed where vertices correspond to $\mathrm{C}_\alpha$ atoms and edges connect adjacent residues. The concatenated coordinates are denoted $x \in \mathbb R^{3m}$ . The system energy is given by

$E(x) = \tfrac12 x^T H x, \quad H_{ij} = \begin{cases} \alpha & \text{if } (i, j) \in \mathcal E, \ 0 & \text{otherwise} \end{cases}$

with $\alpha = \tfrac{3}{(3.8\,\mathrm{\AA})^2}$ enforcing equilibrium bond-lengths.

The forward diffusion is governed by overdamped Langevin dynamics:

$dx_t = -\tfrac12 H x_t\,dt + dW_t$

where $W_t$ is standard Brownian noise. The stationary distribution is

$p_\infty(x) \propto \exp\left(-\tfrac12 x^T H x\right)$

Eigendecomposition $H = P \Lambda P^T$ provides “normal-mode” coordinates $y = P^T x$ , in which the SDE decouples:

$dy_{t,i} = -\tfrac12 \lambda_i y_{t,i}\,dt + dW_{t,i}$

Here, $\lambda_i$ are eigenvalues and $v_i$ are eigenvectors of $H$ , so that each mode diffuses independently.

A diagnostic of the process is the KL divergence between the process at time $t$ and the stationary law:

$\mathrm{KL}[p_{t|0} \| p_\infty] = \sum_{i=1}^{3m} \Bigl[e^{-\lambda_i t}(E_i-\tfrac12) - \tfrac12\log (1-e^{-\lambda_i t})\Bigr], \quad E_i = \tfrac12 \lambda_i (y_0)_i^2$

The eigenvalue spectrum spans several orders of magnitude, so low-frequency (small $\lambda_i$ ) modes persist, while high-frequency (large $\lambda_i$ ) modes equilibrate rapidly. This modal stiffness motivates EigenFold’s cascading-resolution generation.

2. Cascading-Resolution Generative Process

EigenFold’s generative process does not treat all $3m$ modes uniformly. Instead, it “freezes” modes $i$ when $\lambda_i t > \tau$ for a threshold $\tau$ , so only those with $\lambda_i t \le \tau$ remain active at time $t$ . At inference, the process proceeds in $K$ stages, indexed by increasing eigenvalues:

Initialization: sample the $k$ weakest modes from the stationary law at $t = T$ .
For $j=k+1, \dots, 3m$ $j = k + 1, \dots, 3 m$ :
- At $t_j = \tau/\lambda_{(j)}$ , activate mode $\lambda_{(j)}$ by sampling it from its marginal prior.
- Perform a small-step reverse Euler–Maruyama update for active modes, using the learned score network.

By this scheme, global (low-frequency) structure is recovered first, before finer (high-frequency) details. As a result, effective sampling is achieved in $\mathbf{O}(10^2)$ steps, a significant computational improvement over naive $\mathbf{O}(10^3$ – $10^4)$ strategies.

The learned score function $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ is parameterized as an SE(3)-equivariant graph neural network (GNN) using e3nn. The graph is complete over the $m$ residues, with per-node and per-edge features from OmegaFold Geoformer embeddings. The output is an SE(3)-equivariant 3D force field, guaranteeing generative density invariance under rotations and translations.

3. Training Protocol and Implementation

Training employs classic denoising-score matching:

$\mathcal L(\theta) = \mathbb E_{t \sim \mathcal U(0, T),\, x_0 \sim \mathrm{PDB}} \Bigl\|s_\theta(x_t, t) - \nabla_{x_t}\log p_t(x_t | x_0)\Bigr\|^2$

No additional regularizers are applied.

Datasets:

Training: All PDB structures up to 2020-04-30, sequence length 20–256; 230k proteins.
Validation: PDB structures from 2020-05-01 to 2020-11-30; 14k proteins.
Test: 183 recent CAMEO targets (Aug–Oct 2022), $<$ 750 residues.

Key hyperparameters:

Batch size: 64
Learning rate: $5 \times 10^{-4}$ , decayed on plateau
Noise schedule: OU with threshold $\tau = 0.5$
Inference steps: 100–300 (protein-size dependent)

Technical setup:

Implemented in PyTorch with e3nn, run on NVIDIA V100 GPUs
Average training: $\sim$ 5 days on 8 GPUs
Per-protein inference: 2–5 minutes

4. Performance on Benchmark Tasks

On the CAMEO dataset, five structures are sampled per target, and ranked using an approximate ELBO (no auxiliary ranking network). Table 1 summarizes accuracy metrics versus baselines:

Method	RMSD_Cα ↓	TMScore ↑	GDT-TS ↑	lDDT_Cα ↑
AlphaFold2	3.30 / 1.64	0.87 / 0.95	0.86 / 0.91	0.90 / 0.93
ESMFold	3.99 / 2.03	0.85 / 0.93	0.83 / 0.88	0.87 / 0.90
OmegaFold	5.26 / 2.62	0.80 / 0.89	0.77 / 0.84	0.83 / 0.89
RoseTTAFold	5.72 / 3.17	0.77 / 0.84	0.71 / 0.75	0.79 / 0.82
EigenFold	7.37 / 3.50	0.75 / 0.84	0.71 / 0.79	0.78 / 0.85

Median TMScore for EigenFold is 0.84.

5. Ensemble-Based Uncertainty Quantification

Structural deviation metrics $f$ (e.g., TMScore, RMSD_Cα, per-residue lDDT) can be ensemble-averaged as:

$f_{\mathrm{var}} = \frac{1}{N(N-1)} \sum_{i\neq j} f(y_i, y_j), \quad y_i, y_j \sim \mathrm{EigenFold}$

This $f_{\mathrm{var}}$ correlates strongly with the true prediction error $f_{\mathrm{exp}} = \frac{1}{N} \sum_i f(y_i, x^\star)$ , with Pearson correlations at the protein level:

TMScore: $r = 0.88$
GDT-TS: $r = 0.90$
lDDT_Cα: $r = 0.86$

At the residue level, expected pairwise lDDT among samples predicts per-residue lDDT (mean per-target $r = 0.73$ , median $0.81$). Thus, the sampled ensemble server as an estimator of predictive confidence, obviating the need for specialized network heads for confidence prediction.

6. Modeling Conformational Heterogeneity

Tests on curated sets evaluated the capacity of EigenFold to sample conformationally heterogeneous states:

77 fold-switching pairs [Chakravarty et al., 2022]
90 apo/holo ligand-induced pairs [Saldano et al., 2022]

For each sequence, five sampled structures $\{y_i\}$ are evaluated. The coverage of two conformers $x_1, x_2$ by the ensemble is:

$\mathrm{TM}_{\mathrm{ens}} = \frac12 \left[ \max_i \mathrm{TM}(y_i, x_1) + \max_i \mathrm{TM}(y_i, x_2) \right]$

EigenFold's ensemble coverage did not systematically exceed a baseline of always sampling one conformation. The sample diversity

$\mathrm{TM}_{\mathrm{var}} = \mathbb E_{i<j}[\mathrm{TM}(y_i, y_j)]$

correlates only moderately with inter-conformer TMScore: $r = 0.36$ (fold-switching), $r = 0.12$ (apo/holo). Residue-level variance correlates moderately ( $r \approx 0.28$ –$0.41$) with sample positional variance. Thus, EigenFold primarily models its own epistemic uncertainty, rather than capturing fine-grained biophysical heterogeneity of true conformational ensembles.

7. Software and Resources

Source code, pretrained embeddings, and hyperparameter configurations are available through https://github.com/bjing2016/EigenFold (Jing et al., 2023). The implementation leverages PyTorch and e3nn, and employs OmegaFold Geoformer embeddings for input features.

Markdown Report Issue Upgrade to Chat

References (1)

EigenFold: Generative Protein Structure Prediction with Diffusion Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EigenFold.