Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 57 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Multi-state Protein Design with DynamicMPNN (2507.21938v1)

Published 29 Jul 2025 in cs.LG and q-bio.BM

Abstract: Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using AlphaFold initial guess, DynamicMPNN outperforms ProteinMPNN by up to 13% on structure-normalized RMSD across our challenging multi-state protein benchmark.

Summary

The paper introduces a novel multi-state inverse folding model, DynamicMPNN, that designs protein sequences accommodating multiple conformations.
It employs SE(3)-equivariant GVP layers and order-invariant pooling on a curated dataset of 46,033 conformer pairs to achieve up to 13% lower RMSD than previous methods.
Evaluation via an AlphaFold-based framework shows higher pLDDT scores and statistically significant improvements (p < 0.0001) in sequence refoldability.

Multi-state Protein Design with DynamicMPNN: A Technical Analysis

Introduction

The "one sequence, one structure, one function" paradigm has historically dominated protein design, yet a significant subset of biologically relevant proteins exhibit multi-state conformational dynamics essential for their function. The design of protein sequences compatible with multiple conformational states—critical for bioswitches, allosteric regulators, and molecular machines—remains a major challenge. Existing approaches, such as post-hoc aggregation of single-state predictions (e.g., ProteinMPNN-MSD), have demonstrated limited experimental success, largely due to insufficient modeling of the joint sequence-structure landscape and the scarcity of high-quality multi-conformational datasets. The DynamicMPNN framework directly addresses these limitations by introducing an explicit multi-state inverse folding model, trained on a curated dataset of conformational pairs, and evaluated using a robust, structure-grounded metric based on AlphaFold Initial Guess (AFIG).

Dataset Construction and Benchmarking

DynamicMPNN's training leverages a multi-conformational dataset constructed from the PDB and CoDNaS, comprising 46,033 conformer pairs and covering 75% of CATH superfamilies. The dataset curation strategy—selecting pairs with maximal RMSD within high-sequence-identity clusters—maximizes conformational diversity while minimizing alignment artifacts. The test set is deliberately challenging, consisting of 94 proteins with the largest documented conformational changes, including metamorphic, hinge, and transporter proteins. Rigorous train/validation/test splits are enforced by TM-score filtering to prevent structural similarity leakage.

Model Architecture and Training

DynamicMPNN is architected as a geometric deep learning pipeline, extending the gRNAde multi-state GNN framework to proteins. Each conformational state, along with its chemical environment (currently limited to protein interaction partners), is independently encoded using SE(3)-equivariant Geometric Vector Perceptron (GVP) layers. The resulting embeddings are pooled across conformations using Deep Set pooling, ensuring invariance to conformation order. The pooled representation is then passed to an autoregressive sequence decoder, which models the joint conditional distribution $p(Y|X_1, ..., X_m)$ , where $Y$ is the amino acid sequence and $X_1, ..., X_m$ are the backbone structures of the conformational ensemble.

Key architectural features include:

SE(3)-equivariant GVP layers: Maintain geometric consistency and computational efficiency via edge sparsity.
Multi-state GNN encoder: Processes multi-graph representations, maintaining permutation equivariance across both residue and conformation axes.
Order-invariant pooling: Deep Set pooling ensures that the model is agnostic to the order of conformations.
Multi-chain encoding and masking: Incorporates interaction partners, with masking to prevent information leakage from highly similar chains.

Training is performed for 50 epochs, with model selection based on AFIG evaluation on the validation set.

Evaluation Methodology

DynamicMPNN's evaluation departs from traditional sequence recovery metrics, instead focusing on refoldability as assessed by the AFIG framework. AFIG initializes AlphaFold2 backbone frames on target structure coordinates, biasing predictions toward the desired conformation. The primary metrics are:

AFIG RMSD: Cα-RMSD between predicted and target structures for each conformation.
Structure normalization: RMSD normalized by the maximal RMSD between target conformations, contextualizing task difficulty.
Decoy normalization: RMSD normalized by the RMSD to structurally dissimilar decoy structures, controlling for non-specific folding.
pLDDT: AlphaFold2 confidence scores, providing a proxy for foldability and prediction certainty.

This evaluation protocol is more stringent and structure-grounded than sequence recovery, directly measuring the likelihood that a designed sequence will adopt all target conformations.

Results

DynamicMPNN demonstrates consistent improvements over ProteinMPNN-MSD across all AFIG-based metrics:

Best Paired RMSD: Up to 13% lower than ProteinMPNN-MSD, indicating superior multi-state compatibility.
Best Paired pLDDT: Up to 3% higher, reflecting increased confidence in predicted structures.
Statistical significance: Improvements are robust (Wilcoxon signed-rank test, $p < 0.0001$ ).

Notably, DynamicMPNN outperforms ProteinMPNN-MSD even in "Best Single" metrics, despite the latter's training data advantage (data leakage from sequence clusters). However, average metrics across all sampled sequences are slightly worse than natural sequences, consistent with prior observations that inverse folding models optimize for stability rather than the multi-objective constraints of natural evolution.

DynamicMPNN's sequence recovery and perplexity are lower than ProteinMPNN-MSD, but this is expected given the focus on refoldability rather than sequence similarity. The model's performance is particularly notable given the challenging benchmark, which includes proteins with the largest conformational changes in the known proteome.

Implementation Considerations

Computational Requirements

Model complexity: The use of GVP layers and multi-state GNNs increases computational overhead relative to single-state models, but edge sparsity and pooling strategies mitigate this.
Training data: Construction of a high-quality, non-redundant multi-conformational dataset is non-trivial and requires careful curation to avoid data leakage.
Evaluation: AFIG-based evaluation is computationally intensive, as it requires multiple AlphaFold2 runs per designed sequence and conformation.

Deployment Strategies

Integration with experimental pipelines: The "Best Paired" metric aligns with experimental validation workflows, where only the top candidates are selected for synthesis and characterization.
Extension to other interaction partners: Current implementation is limited to protein-protein interactions; extension to nucleic acids or small molecules would require additional featurization and masking strategies.
Scalability: The model is suitable for high-throughput design, but inference speed is constrained by the need for multiple structure predictions per sequence.

Limitations

Conformational coverage: The model is trained on discrete conformational pairs, not continuous conformational landscapes (e.g., intrinsically disordered proteins).
Evaluation bias: AFIG does not use MSA information, potentially underestimating the importance of long-range evolutionary constraints present in natural sequences.
Sequence diversity: Designed sequences may be biased toward local stability, potentially limiting functional diversity.

Implications and Future Directions

DynamicMPNN represents a significant methodological advance in multi-state protein design, enabling the explicit modeling of sequence constraints across conformational ensembles. This has immediate applications in the design of bioswitches, allosteric regulators, and synthetic molecular machines. The framework also provides a template for extending multi-state design to other biomolecular systems, such as RNA or protein-nucleic acid complexes.

Future developments may include:

Continuous conformational modeling: Extending the approach to model continuous structural ensembles, potentially via diffusion models or normalizing flows.
Integration with experimental feedback: Incorporating high-throughput experimental data to further refine sequence-structure compatibility.
Generalization to other modalities: Adapting the architecture for ligand binding, post-translational modifications, or membrane environments.

Conclusion

DynamicMPNN introduces an explicit, joint-learning approach to multi-state protein design, outperforming post-hoc aggregation strategies on a challenging benchmark. By leveraging geometric deep learning and a robust, structure-based evaluation protocol, DynamicMPNN advances the state of the art in designing sequences compatible with multiple functional conformations. The framework sets a new standard for multi-state design and opens avenues for the rational engineering of dynamic protein systems.