EigenFold: Protein Structure Diffusion Modeling
- The paper introduces a diffusion process on backbone normal-mode coordinates to efficiently model protein tertiary structure ensembles.
- It employs a cascading-resolution strategy that recovers global structure first, reducing inference steps from O(10^3–10^4) to O(10^2).
- The framework integrates an SE(3)-equivariant GNN with denoising-score matching for robust uncertainty quantification and ensemble validation.
EigenFold is a generative modeling framework for protein structure prediction based on diffusion processes in the space of backbone conformations. Unlike single-structure deterministic predictors, EigenFold is designed to explicitly sample from a distribution over possible tertiary structures conditioned on a protein sequence, thereby enabling the modeling of conformational ensembles and structural uncertainty. The core innovation is a diffusion process defined on the normal-mode coordinates of a graph representation of the protein backbone, yielding a cascading-resolution generative strategy that is computationally efficient and theoretically grounded (Jing et al., 2023).
1. Mathematical Foundations
At its foundation, EigenFold treats the protein backbone as a mechanical system of overdamped harmonic oscillators. For a protein of residues, a graph is constructed where vertices correspond to atoms and edges connect adjacent residues. The concatenated coordinates are denoted . The system energy is given by
with enforcing equilibrium bond-lengths.
The forward diffusion is governed by overdamped Langevin dynamics:
where is standard Brownian noise. The stationary distribution is
Eigendecomposition provides “normal-mode” coordinates , in which the SDE decouples:
Here, are eigenvalues and are eigenvectors of , so that each mode diffuses independently.
A diagnostic of the process is the KL divergence between the process at time and the stationary law:
The eigenvalue spectrum spans several orders of magnitude, so low-frequency (small ) modes persist, while high-frequency (large ) modes equilibrate rapidly. This modal stiffness motivates EigenFold’s cascading-resolution generation.
2. Cascading-Resolution Generative Process
EigenFold’s generative process does not treat all $3m$ modes uniformly. Instead, it “freezes” modes when for a threshold , so only those with remain active at time . At inference, the process proceeds in stages, indexed by increasing eigenvalues:
- Initialization: sample the weakest modes from the stationary law at .
- For :
- At , activate mode by sampling it from its marginal prior.
- Perform a small-step reverse Euler–Maruyama update for active modes, using the learned score network.
By this scheme, global (low-frequency) structure is recovered first, before finer (high-frequency) details. As a result, effective sampling is achieved in steps, a significant computational improvement over naive – strategies.
The learned score function is parameterized as an SE(3)-equivariant graph neural network (GNN) using e3nn. The graph is complete over the residues, with per-node and per-edge features from OmegaFold Geoformer embeddings. The output is an SE(3)-equivariant 3D force field, guaranteeing generative density invariance under rotations and translations.
3. Training Protocol and Implementation
Training employs classic denoising-score matching:
No additional regularizers are applied.
Datasets:
- Training: All PDB structures up to 2020-04-30, sequence length 20–256; 230k proteins.
- Validation: PDB structures from 2020-05-01 to 2020-11-30; 14k proteins.
- Test: 183 recent CAMEO targets (Aug–Oct 2022), 750 residues.
Key hyperparameters:
- Batch size: 64
- Learning rate: , decayed on plateau
- Noise schedule: OU with threshold
- Inference steps: 100–300 (protein-size dependent)
Technical setup:
- Implemented in PyTorch with e3nn, run on NVIDIA V100 GPUs
- Average training: 5 days on 8 GPUs
- Per-protein inference: 2–5 minutes
4. Performance on Benchmark Tasks
On the CAMEO dataset, five structures are sampled per target, and ranked using an approximate ELBO (no auxiliary ranking network). Table 1 summarizes accuracy metrics versus baselines:
| Method | RMSD_Cα ↓ | TMScore ↑ | GDT-TS ↑ | lDDT_Cα ↑ |
|---|---|---|---|---|
| AlphaFold2 | 3.30 / 1.64 | 0.87 / 0.95 | 0.86 / 0.91 | 0.90 / 0.93 |
| ESMFold | 3.99 / 2.03 | 0.85 / 0.93 | 0.83 / 0.88 | 0.87 / 0.90 |
| OmegaFold | 5.26 / 2.62 | 0.80 / 0.89 | 0.77 / 0.84 | 0.83 / 0.89 |
| RoseTTAFold | 5.72 / 3.17 | 0.77 / 0.84 | 0.71 / 0.75 | 0.79 / 0.82 |
| EigenFold | 7.37 / 3.50 | 0.75 / 0.84 | 0.71 / 0.79 | 0.78 / 0.85 |
Median TMScore for EigenFold is 0.84.
5. Ensemble-Based Uncertainty Quantification
Structural deviation metrics (e.g., TMScore, RMSD_Cα, per-residue lDDT) can be ensemble-averaged as:
This correlates strongly with the true prediction error , with Pearson correlations at the protein level:
- TMScore:
- GDT-TS:
- lDDT_Cα:
At the residue level, expected pairwise lDDT among samples predicts per-residue lDDT (mean per-target , median $0.81$). Thus, the sampled ensemble server as an estimator of predictive confidence, obviating the need for specialized network heads for confidence prediction.
6. Modeling Conformational Heterogeneity
Tests on curated sets evaluated the capacity of EigenFold to sample conformationally heterogeneous states:
- 77 fold-switching pairs [Chakravarty et al., 2022]
- 90 apo/holo ligand-induced pairs [Saldano et al., 2022]
For each sequence, five sampled structures are evaluated. The coverage of two conformers by the ensemble is:
EigenFold's ensemble coverage did not systematically exceed a baseline of always sampling one conformation. The sample diversity
correlates only moderately with inter-conformer TMScore: (fold-switching), (apo/holo). Residue-level variance correlates moderately (–$0.41$) with sample positional variance. Thus, EigenFold primarily models its own epistemic uncertainty, rather than capturing fine-grained biophysical heterogeneity of true conformational ensembles.
7. Software and Resources
Source code, pretrained embeddings, and hyperparameter configurations are available through https://github.com/bjing2016/EigenFold (Jing et al., 2023). The implementation leverages PyTorch and e3nn, and employs OmegaFold Geoformer embeddings for input features.