A Text-guided Protein Design Framework (2302.04611v3)
Abstract: Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90\% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
- Chase R Freschlin, Sarah A Fahlberg and Philip A Romero “Machine learning to navigate fitness landscapes for protein engineering” In Current Opinion in Biotechnology 75, 2022, pp. 102713 DOI: 10.1016/j.copbio.2022.102713
- “Highly accurate protein structure prediction with AlphaFold” In Nature 596.7873 Nature Publishing Group, 2021, pp. 583–589
- “CryoDRGN2”, 2021, pp. 4066–4075
- “Learning inverse folding from millions of predicted structures” In bioRxiv Cold Spring Harbor Laboratory, 2022
- “Protein sequence design with deep generative models” In Current Opinion in Chemical Biology 65, Mechanistic Biology * Machine Learning in Chemical Biology, 2021, pp. 18–27 DOI: 10.1016/j.cbpa.2021.04.004
- “International Conference on Machine Learning”, 2021, pp. 8844–8856 PMLR
- “ProtTrans: towards cracking the language of Life’s code through self-supervised deep learning and high performance computing” In arXiv preprint arXiv:2007.06225, 2020
- “Language models enable zero-shot prediction of the effects of mutations on protein function” In bioRxiv, 2021 DOI: 10.1101/2021.07.09.450648
- “SESNet” In arXiv preprint arXiv:2301.00004 arXiv, 2022 DOI: 10.48550/arXiv.2301.00004
- “International Conference on Learning Representations”, 2021 URL: https://openreview.net/forum?id=1YLJDvSx6J4
- “Learning protein representations via complete 3d graph networks” In arXiv preprint arXiv:2207.12600, 2022
- “International Conference on Machine Learning”, 2021, pp. 8748–8763 PMLR
- “Glide: Towards photorealistic image generation and editing with text-guided diffusion models” In arXiv preprint arXiv:2112.10741, 2021
- “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
- “Proceedings of the IEEE/CVF International Conference on Computer Vision”, 2021, pp. 2085–2094
- “International Conference on Artificial Intelligence and Statistics”, 2022, pp. 8906–8920 PMLR
- Carl Edwards, ChengXiang Zhai and Heng Ji “Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing”, 2021, pp. 595–607
- “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals” In Nature communications 13.1 Nature Publishing Group, 2022, pp. 1–11
- “Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing” In arXiv preprint arXiv:2212.10789, 2022
- “ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback” In arXiv preprint arXiv:2305.18090, 2023
- UniProt Consortium “The universal protein resource (UniProt)” In Nucleic acids research 36.suppl_1 Oxford University Press, 2007, pp. D190–D195
- “Gene Ontology: tool for the unification of biology” In Nature Genetics 25.1 Nature Publishing Group, 2000, pp. 25–29
- UniProt “UniProtKG/Swiss-Prot”, https://www.expasy.org/resources/uniprotkb-swiss-prot, 2023
- “Plant Bioinformatics” Springer, 2007, pp. 89–112
- Carl Ivar Branden and John Tooze “Introduction to protein structure” Garland Science, 2012
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- Martin Steinegger, Milot Mirdita and Johannes Söding “Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold” In Nature methods 16.7 Nature Publishing Group, 2019, pp. 603–606
- “Clustering huge protein sequence sets in linear time” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–8
- Iz Beltagy, Kyle Lo and Arman Cohan “SciBERT: A pretrained language model for scientific text” In arXiv preprint arXiv:1903.10676, 2019
- Suzanne Fricke “Semantic scholar” In Journal of the Medical Library Association: JMLA 106.1 Medical Library Association, 2018, pp. 145
- “Galactica: A large language model for science” In arXiv preprint arXiv:2211.09085, 2022
- “ChatPathway: Conversational Large Language Models for Biology Pathway Detection” In NeurIPS 2023 AI for Science Workshop, 2023
- Neil Savage “Drug discovery companies are customizing ChatGPT: here’s how” In Nature Biotechnology, 2023
- “Difformer: Empowering Diffusion Model on Embedding Space for Text Generation” In arXiv preprint arXiv:2212.09412, 2022
- “GENIE: Large Scale Pre-training for Text Generation with Diffusion Model” In arXiv preprint arXiv:2212.11685, 2022
- “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features” In Biopolymers: Original Research on Biomolecules 22.12 Wiley Online Library, 1983, pp. 2577–2637
- “Global analysis of protein folding using massively parallel design, synthesis, and testing” In Science 357.6347 American Association for the Advancement of Science, 2017, pp. 168–175
- “The protein data bank” In Nucleic acids research 28.1 Oxford University Press, 2000, pp. 235–242
- “A Multi-Grained Group Symmetric Framework for Learning Protein-Ligand Binding Dynamics” under review In Submitted to The Twelfth International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=J4V3lW9hq6
- “GNINA 1.0: molecular docking with deep learning” In Journal of cheminformatics 13.1 BioMed Central, 2021, pp. 1–20
- “Evaluating protein transfer learning with TAPE” In Advances in neural information processing systems 32, 2019
- “NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning” In Proteins: Structure, Function, and Bioinformatics 87.6 Wiley Online Library, 2019, pp. 520–527
- Jie Hou, Badri Adhikari and Jianlin Cheng “DeepSF: deep convolutional neural network for mapping protein sequences to folds” In Bioinformatics 34.8 Oxford University Press, 2018, pp. 1295–1303
- Naomi K Fox, Steven E Brenner and John-Marc Chandonia “SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures” In Nucleic Acids Research 42.D1 Oxford University Press, 2013, pp. D304–D309
- Mohammed AlQuraishi “ProteinNet: a standardized data set for machine learning of protein structure” In BMC Bioinformatics 20.1 Springer, 2019, pp. 1–10
- “Critical assessment of methods of protein structure prediction (CASP)-Round XII” In Proteins: Structure, Function, and Bioinformatics 86 John Wiley & Sons, Ltd, 2018, pp. 7–15 DOI: 10.1002/prot.25415
- “The protein data bank” In Nucleic Acids Research 28.1 Oxford University Press, 2000, pp. 235–242
- “Local fitness landscape of the green fluorescent protein” In Nature 533.7603 Nature Publishing Group, 2016, pp. 397
- “Long short-term memory” In Neural computation 9.8 MIT Press, 1997, pp. 1735–1780
- “Proceedings of the IEEE conference on computer vision and pattern recognition”, 2016, pp. 770–778
- “Ontoprotein: Protein pretraining with gene ontology embedding” In arXiv preprint:2201.11147, 2022
- “Illuminating protein space with a programmable generative model” In bioRxiv Cold Spring Harbor Laboratory, 2022 DOI: 10.1101/2022.12.01.518682
- “PubTator central” In Nucleic Acids Research 47.W1, 2019, pp. W587–W593 DOI: 10.1093/nar/gkz389
- “International Conference on Learning Representations”, 2020
- “Dual use of artificial-intelligence-powered drug discovery” In Nature Machine Intelligence 4.3, 2022, pp. 189–191 DOI: 10.1038/s42256-022-00465-9
- “Neural networks to learn protein sequence–function relationships from deep mutational scanning data” In Proceedings of the National Academy of Sciences 118.48, 2021, pp. e2104878118 DOI: 10.1073/pnas.2104878118
- “ECNet is an evolutionary context-integrated deep learning framework for protein engineering” In Nature Communications 12.1, 2021, pp. 5743 DOI: 10.1038/s41467-021-25976-8
- “Low-N protein engineering with data-efficient deep learning” In Nature Methods 18.4, 2021, pp. 389–396 DOI: 10.1038/s41592-021-01100-y
- “Proceedings of the 39th International Conference on Machine Learning” PMLR, 2022, pp. 16990–17017 URL: https://proceedings.mlr.press/v162/notin22a.html
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv preprint arXiv:1910.13461, 2019
- “Exploring the limits of transfer learning with a unified text-to-text transformer.” In J. Mach. Learn. Res. 21.140, 2020, pp. 1–67
- Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
- Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
- “Generative modeling by estimating gradients of the data distribution” In Advances in Neural Information Processing Systems 32, 2019
- “Score-based generative modeling through stochastic differential equations” In arXiv preprint arXiv:2011.13456, 2020
- “Learning deep representations by mutual information estimation and maximization” In arXiv preprint arXiv:1808.06670, 2018
- Philip Bachman, R Devon Hjelm and William Buchwalter “Learning representations by maximizing mutual information across views” In Advances in Neural Information Processing Systems 32, 2019
- Aaron van den Oord, Yazhe Li and Oriol Vinyals “Representation learning with contrastive predictive coding” In arXiv preprint arXiv:1807.03748, 2018
- “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2020, pp. 9729–9738
- “International Conference on Learning Representations”, 2022 URL: https://openreview.net/forum?id=xQUe1pOKPam
- “A tutorial on energy-based learning” In Predicting structured data 1.0, 2006
- “Supervised contrastive learning” In arXiv preprint arXiv:2004.11362, 2020
- Shengchao Liu, Hongyu Guo and Jian Tang “International Conference on Learning Representations”, 2023 URL: https://openreview.net/forum?id=CjTHVo1dvR
- “Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020” Association for Computational Linguistics, 2020, pp. 7871–7880 DOI: 10.18653/v1/2020.acl-main.703
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In J. Mach. Learn. Res. 21, 2020, pp. 140:1–140:67 URL: http://jmlr.org/papers/v21/20-074.html
- “Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020” ISCA, 2020, pp. 4676–4680 DOI: 10.21437/Interspeech.2020-1066
- “IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019” IEEE, 2019, pp. 449–456 DOI: 10.1109/ASRU46091.2019.9003750
- “Muse: Text-To-Image Generation via Masked Generative Transformers” In arXiv preprint arXiv:2301.00704, 2023
- Yang Song and Diederik P Kingma “How to train your energy-based models” In arXiv preprint arXiv:2101.03288, 2021
- “Argmax flows and multinomial diffusion: Learning categorical distributions” In Advances in Neural Information Processing Systems 34, 2021, pp. 12454–12465
- “Structured denoising diffusion models in discrete state-spaces” In Advances in Neural Information Processing Systems 34, 2021, pp. 17981–17993
- “Diffusion-LM Improves Controllable Text Generation” In arXiv preprint arXiv:2205.14217, 2022
- “Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII”, 2022, pp. 170–188 Springer
- “International Conference on Machine Learning”, 2021, pp. 8821–8831 PMLR
- Tero Karras, Samuli Laine and Timo Aila “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2019, pp. 4401–4410
- “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In arXiv preprint arXiv:2205.11487, 2022
- “Structured Multi-View Representations for Drug Combinations” In Machine Learning for Molecules Workshop at NeurIPS, 2020
- “A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language” In arXiv preprint arXiv:2209.05481, 2022
- “Protein design and variant prediction using autoregressive generative models” In Nature Communications 12.1, 2021, pp. 2403 DOI: 10.1038/s41467-021-22732-w
- Noelia Ferruz, Steffen Schmidt and Birte Höcker “ProtGPT2 is a deep unsupervised language model for protein design” In Nature Communications 13.1, 2022, pp. 4348 DOI: 10.1038/s41467-022-32007-7
- “Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering” In Thirty-seventh Conference on Neural Information Processing Systems, 2023
- Zak Costello and Hector Garcia Martin “How to Hallucinate Functional Proteins” In arXiv:1903.00458 [q-bio], 2019 DOI: 10.48550/arXiv.1903.00458
- “A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences” In Cell Systems 11.1, 2020, pp. 49–62.e16 DOI: 10.1016/j.cels.2020.05.007
- “Generating functional protein variants with variational autoencoders” In PLOS Computational Biology 17.2, 2021, pp. e1008736 DOI: 10.1371/journal.pcbi.1008736
- “Transformer-based protein generation with regularized latent space optimization” In Nature Machine Intelligence 4.10, 2022, pp. 840–851 DOI: 10.1038/s42256-022-00532-1
- “ProT-VAE” In bioRxiv, 2023 DOI: 10.1101/2023.01.23.525232
- “De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks” In Journal of Chemical Information and Modeling 60.12, 2020, pp. 5667–5681 DOI: 10.1021/acs.jcim.0c00593
- “Expanding functional protein sequence spaces using generative adversarial networks” In Nature Machine Intelligence 3.4, 2021, pp. 324–333 DOI: 10.1038/s42256-021-00310-5
- Damiano Sgarbossa, Umberto Lupo and Anne-Florence Bitbol “Generative power of a protein language model trained on multiple sequence alignments” In eLife 12, 2023, pp. e79854 DOI: 10.7554/eLife.79854
- “Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
- “Binding peptide generation for MHC Class I proteins with deep reinforcement learning” In Bioinformatics 39.2, 2023, pp. btad055
- “Designing a Prospective COVID-19 Therapeutic with Reinforcement Learning” In Machine Learning for Molecules Workshop at the 34th Conference on Neural Information Processing Systems, 2020
- “Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
- “T-Cell Receptor Optimization with Reinforcement Learning and Mutation Polices for Precision Immunotherapy” In Research in Computational Molecular Biology Cham: Springer Nature Switzerland, 2023, pp. 174–191
- Wenze Ding, Kenta Nakai and Haipeng Gong “Protein design via deep learning” In Briefings in Bioinformatics 23.3, 2022, pp. bbac102 DOI: 10.1093/bib/bbac102
- “Advances in Neural Information Processing Systems” Curran Associates, Inc., 2019
- “Robust deep learning–based protein sequence design using ProteinMPNN” In Science 378.6615, 2022, pp. 49–56 DOI: 10.1126/science.add2187
- “Structure-informed Language Models Are Protein Designers” In bioRxiv, 2023, pp. 2023.02.03.526917 DOI: 10.1101/2023.02.03.526917
- “Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models” In arXiv preprint arXiv:2205.15019, 2022
- Jin Sub Lee and Philip M. Kim “ProteinSGM” In bioRxiv, 2022, pp. 2022.07.13.499967 DOI: 10.1101/2022.07.13.499967
- “Protein structure generation via folding diffusion” arXiv:2209.15611 [cs, q-bio] In arXiv:2209.15611 [q-bio], 2022 DOI: 10.48550/arXiv.2209.15611
- “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models” In bioRxiv, 2022, pp. 2022.12.09.519842 DOI: 10.1101/2022.12.09.519842
- “A high-level programming language for generative protein design” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521526
- “Language models generalize beyond natural proteins” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521521
- “Diffusion models beat gans on image synthesis” In Advances in Neural Information Processing Systems 34, 2021, pp. 8780–8794
- “International conference on machine learning”, 2020, pp. 1597–1607 PMLR
- “Big self-supervised models are strong semi-supervised learners” In Advances in neural information processing systems 33, 2020, pp. 22243–22255
- “Cdconv: A benchmark for contradiction detection in chinese conversations” In arXiv preprint arXiv:2210.08511, 2022
- Shengchao Liu (30 papers)
- Yanjing Li (26 papers)
- Zhuoxinran Li (5 papers)
- Anthony Gitter (17 papers)
- Yutao Zhu (63 papers)
- Jiarui Lu (31 papers)
- Zhao Xu (47 papers)
- Weili Nie (41 papers)
- Arvind Ramanathan (31 papers)
- Chaowei Xiao (110 papers)
- Jian Tang (327 papers)
- Hongyu Guo (48 papers)
- Anima Anandkumar (236 papers)