Progressive Multi-Modality Learning for Inverse Protein Folding (2312.06297v2)
Abstract: While deep generative models show promise for learning inverse protein folding directly from data, the lack of publicly available structure-sequence pairings limits their generalization. Previous improvements and data augmentation efforts to overcome this bottleneck have been insufficient. To further address this challenge, we propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. To our knowledge, MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based LLM to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks. To further assess the biological plausibility, we present systematic quantitative analysis techniques that provide interpretability and reveal more about the laws of protein design.
- The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation, 13(6):3031–3048, 2017.
- Protein sequence design with a learned potential. Nature communications, 13(1):746, 2022.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Alphadesign: A graph protein design method and benchmark on alphafolddb. arXiv preprint arXiv:2202.01079, 2022.
- Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
- Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11303–11312, 2021.
- Learning inverse folding from millions of predicted structures. bioRxiv, 2022.
- Learning complete protein representation by deep coupling of sequence and structure. bioRxiv, pages 2023–07, 2023.
- Protein language models and structure prediction: Connection and progression. arXiv preprint arXiv:2211.16742, 2022.
- Protein 3d graph structure learning for robust structure-based protein property prediction. arXiv preprint arXiv:2310.11466, 2023.
- Data-efficient protein 3d geometric pretraining via refinement of diffused protein structure decoy. arXiv preprint arXiv:2302.10888, 2023.
- Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
- Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, pages 2022–07, 2022.
- Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles. Proteins: Structure, Function, and Bioinformatics, 82(10):2565–2573, 2014.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Co-modeling the sequential and graphical route for peptide. arXiv preprint arXiv:2310.02964, 2023.
- Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11542–11551, 2021.
- Spin2: Predicting sequence profiles from protein structures using deep neural networks. Proteins: Structure, Function, and Bioinformatics, 86(6):629–633, 2018.
- Cath–a hierarchic classification of protein domain structures. Structure, 5(8):1093–1109, 1997.
- Fast and flexible protein design using deep graph neural networks. Cell systems, 11(4):402–411, 2020.
- Generative de novo protein design with global context. arXiv preprint arXiv:2204.10673, 2022.
- Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
- Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302–2309, 2005.
- Enhancing neural sign language translation by highlighting the facial expression information. Neurocomputing, 464:462–472, 2021.
- Leveraging graph-based cross-modal information fusion for neural sign language translation. arXiv preprint arXiv:2211.00526, 2022.
- Lightweight contrastive protein structure-sequence transformation. arXiv preprint arXiv:2303.11783, 2023.
- Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23141–23150, 2023.
- Using context-to-vector with graph retrofitting to improve word embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8154–8163, 2022.
- An improved sign language translation model with explainable adaptations for processing long sign sentences. Computational Intelligence and Neuroscience, 2020, 2020.
- C2slr: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5131–5140, 2022.
- Jiangbin Zheng (25 papers)
- Stan Z. Li (222 papers)