Multi-state Protein Design with DynamicMPNN (2507.21938v1)

Published 29 Jul 2025 in cs.LG and q-bio.BM

Abstract: Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using AlphaFold initial guess, DynamicMPNN outperforms ProteinMPNN by up to 13% on structure-normalized RMSD across our challenging multi-state protein benchmark.

Summary

The paper introduces DynamicMPNN, a multi-state protein design framework that explicitly models sequence compatibility across diverse conformations.
It employs a multi-state SE(3)-equivariant geometric deep learning architecture with GVP layers and Deep Set pooling to efficiently encode complex conformational ensembles.
The approach demonstrates improved performance with up to 13% lower RMSD and 3% higher pLDDT scores, alongside robust data leakage controls for enhanced generalization.

DynamicMPNN: A Geometric Deep Learning Approach for Multi-State Protein Design

Introduction

The "one sequence, one structure, one function" paradigm has historically dominated protein design, yet a significant fraction of biologically relevant proteins exhibit conformational plasticity, adopting multiple functional states. This conformational diversity is central to processes such as enzyme catalysis, allostery, and molecular transport. However, computational protein design has lagged in addressing multi-state requirements, with most methods relying on post-hoc aggregation of single-state predictions, resulting in low experimental success rates for multi-state targets. The DynamicMPNN framework directly addresses this gap by introducing an explicit multi-state inverse folding model, leveraging geometric deep learning to generate sequences compatible with multiple conformations.

Dataset Construction and Benchmarking

A major bottleneck in multi-state protein design is the scarcity of high-quality, multi-conformational structural data. DynamicMPNN circumvents this by constructing a dataset of 46,033 conformer pairs, derived from the PDB and CoDNaS databases, covering 75% of CATH superfamilies. The dataset is curated to maximize conformational diversity by selecting pairs with the highest RMSD within each cluster, ensuring that the model is exposed to the most challenging conformational transitions. The benchmark set comprises 94 proteins with large conformational changes, including metamorphic, hinge, and transporter proteins, providing a stringent testbed for multi-state design.

Model Architecture and Training

DynamicMPNN models the joint conditional distribution $p(Y|X_1, ..., X_m)$ , where $Y$ is the amino acid sequence and $\{X_1, ..., X_m\}$ are the target conformations. Unlike previous approaches that aggregate independent single-state predictions, DynamicMPNN employs a multi-state GNN encoder based on SE(3)-equivariant Geometric Vector Perceptron (GVP) layers. Each conformation, along with its chemical environment, is encoded independently, and the resulting embeddings are pooled using Deep Set pooling to ensure invariance to conformation order. The pooled embedding is then used for autoregressive sequence generation.

Key architectural features include:

SE(3)-equivariant GVP layers for both encoder and decoder, maintaining geometric consistency and computational efficiency.
Multi-state GNN encoding that processes conformational ensembles as multi-graphs, preserving permutation equivariance across both residue and conformation axes.
Alignment and pooling strategies to handle nonidentical sequences and missing residues, using pairwise sequence alignments and order-invariant pooling.
Multi-chain encoding and masking to incorporate interaction partners, with masking to prevent information leakage from highly similar chains.

The model is trained for 50 epochs, with validation using the AlphaFold Initial Guess (AFIG) framework to select the best-performing checkpoint.

Evaluation Metrics and Protocol

Traditional sequence recovery metrics are insufficient for multi-state design, as they do not capture the ability of a sequence to fold into multiple target conformations. DynamicMPNN introduces a multi-state self-consistency metric based on AFIG, which biases AlphaFold2 predictions towards the target backbone coordinates. The primary evaluation metrics are:

AFIG RMSD: Cα-RMSD between the AFIG-predicted structure and the target conformation.
Structure-normalized RMSD: AFIG RMSD normalized by the maximum RMSD between target conformations, contextualizing design difficulty.
Decoy-normalized RMSD: AFIG RMSD normalized by the RMSD to structurally dissimilar decoy structures, assessing specificity.
pLDDT: AlphaFold2 confidence scores, indicating foldability and prediction certainty.

Aggregated metrics are computed for the best single sequence, best paired (averaged over both states), and average across all sampled sequences.

Results

DynamicMPNN demonstrates consistent improvements over the ProteinMPNN multi-state design (ProteinMPNN-MSD) baseline across all metrics. Notably:

Best paired RMSD: DynamicMPNN achieves up to 13% lower RMSD compared to ProteinMPNN-MSD, a statistically significant improvement (Wilcoxon signed-rank test, $p < 0.0001$ ).
pLDDT scores: DynamicMPNN yields up to 3% higher confidence scores, indicating more reliable folding into both target states.
Data leakage control: Despite ProteinMPNN's training set containing proteins highly similar to 84 out of 94 benchmark proteins, DynamicMPNN was trained with strict exclusion of sequence similarity to the test set, underscoring the robustness of its generalization.

A case paper on the Switch Arc protein illustrates that DynamicMPNN successfully recapitulates the central β-sheet fold in both conformations, whereas ProteinMPNN fails to do so. While the best-of-16 designed sequences from both models outperform natural sequences in refoldability, the average performance across all designs is slightly inferior to natural sequences, likely reflecting the multi-objective optimization of natural evolution versus the structure-centric optimization of inverse folding models.

DynamicMPNN underperforms ProteinMPNN on sequence recovery and perplexity, but this is expected given the focus on structural compatibility rather than sequence similarity. The authors argue, in line with recent literature, that refoldability is a more direct and relevant metric for inverse folding in the multi-state context.

Implementation Considerations

DynamicMPNN is implemented using PyTorch and leverages efficient GVP-based message passing for both encoding and decoding. The model requires access to high-quality structural ensembles and benefits from GPU acceleration due to the computational demands of SE(3)-equivariant operations and large-scale data processing. The AFIG-based evaluation pipeline is computationally intensive, as it necessitates multiple AlphaFold2 runs per designed sequence and conformation.

For practical deployment:

Dataset curation is critical; maximizing conformational diversity and minimizing sequence redundancy are essential for robust generalization.
Model scaling to larger conformational ensembles or higher-order oligomers may require further architectural optimizations, particularly in pooling strategies and memory management.
Integration with experimental pipelines: The best-performing sequences, as identified by AFIG metrics, should be prioritized for in vitro validation, given the low experimental success rates observed in prior multi-state design studies.

Implications and Future Directions

DynamicMPNN represents a significant methodological advance in multi-state protein design, enabling the explicit modeling of sequence constraints across conformational ensembles. This has direct implications for the engineering of synthetic bioswitches, allosteric regulators, and molecular machines, where multi-state compatibility is essential. The approach also provides a framework for benchmarking and evaluating multi-state design methods, moving beyond sequence recovery to structure-grounded metrics.

Future developments may include:

Extension to continuous conformational landscapes, such as intrinsically disordered proteins, by incorporating larger and more diverse structural ensembles.
Incorporation of non-protein interaction partners (e.g., ligands, nucleic acids) to model more complex functional states.
Improved pooling and attention mechanisms to better capture long-range and inter-state dependencies.
Integration with generative backbone design to enable end-to-end multi-state de novo protein engineering.

Conclusion

DynamicMPNN establishes a new paradigm for multi-state protein design by jointly learning sequence compatibility across conformational ensembles using geometric deep learning. The model achieves up to 13% improvement in structure-grounded metrics over state-of-the-art baselines on a challenging benchmark, despite stringent controls on data leakage. This work provides both a methodological foundation and a practical toolset for advancing the design of dynamic, functionally versatile proteins.

PDF Markdown

Follow-up Questions

Related Papers

Authors (9)

Tweets

https://twitter.com/Pastel/status/1950491994565804056

https://twitter.com/chaitjo/status/1950591099929530798

https://twitter.com/chaitjo/status/1950577919639667156

alphaXiv

Multi-state Protein Design with DynamicMPNN (9 likes, 0 questions)