Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Text-guided Protein Design Framework (2302.04611v3)

Published 9 Feb 2023 in cs.LG, cs.AI, q-bio.QM, and stat.ML

Abstract: Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90\% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (120)
  1. Chase R Freschlin, Sarah A Fahlberg and Philip A Romero “Machine learning to navigate fitness landscapes for protein engineering” In Current Opinion in Biotechnology 75, 2022, pp. 102713 DOI: 10.1016/j.copbio.2022.102713
  2. “Highly accurate protein structure prediction with AlphaFold” In Nature 596.7873 Nature Publishing Group, 2021, pp. 583–589
  3. “CryoDRGN2”, 2021, pp. 4066–4075
  4. “Learning inverse folding from millions of predicted structures” In bioRxiv Cold Spring Harbor Laboratory, 2022
  5. “Protein sequence design with deep generative models” In Current Opinion in Chemical Biology 65, Mechanistic Biology * Machine Learning in Chemical Biology, 2021, pp. 18–27 DOI: 10.1016/j.cbpa.2021.04.004
  6. “International Conference on Machine Learning”, 2021, pp. 8844–8856 PMLR
  7. “ProtTrans: towards cracking the language of Life’s code through self-supervised deep learning and high performance computing” In arXiv preprint arXiv:2007.06225, 2020
  8. “Language models enable zero-shot prediction of the effects of mutations on protein function” In bioRxiv, 2021 DOI: 10.1101/2021.07.09.450648
  9. “SESNet” In arXiv preprint arXiv:2301.00004 arXiv, 2022 DOI: 10.48550/arXiv.2301.00004
  10. “International Conference on Learning Representations”, 2021 URL: https://openreview.net/forum?id=1YLJDvSx6J4
  11. “Learning protein representations via complete 3d graph networks” In arXiv preprint arXiv:2207.12600, 2022
  12. “International Conference on Machine Learning”, 2021, pp. 8748–8763 PMLR
  13. “Glide: Towards photorealistic image generation and editing with text-guided diffusion models” In arXiv preprint arXiv:2112.10741, 2021
  14. “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
  15. “Proceedings of the IEEE/CVF International Conference on Computer Vision”, 2021, pp. 2085–2094
  16. “International Conference on Artificial Intelligence and Statistics”, 2022, pp. 8906–8920 PMLR
  17. Carl Edwards, ChengXiang Zhai and Heng Ji “Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing”, 2021, pp. 595–607
  18. “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals” In Nature communications 13.1 Nature Publishing Group, 2022, pp. 1–11
  19. “Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing” In arXiv preprint arXiv:2212.10789, 2022
  20. “ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback” In arXiv preprint arXiv:2305.18090, 2023
  21. UniProt Consortium “The universal protein resource (UniProt)” In Nucleic acids research 36.suppl_1 Oxford University Press, 2007, pp. D190–D195
  22. “Gene Ontology: tool for the unification of biology” In Nature Genetics 25.1 Nature Publishing Group, 2000, pp. 25–29
  23. UniProt “UniProtKG/Swiss-Prot”, https://www.expasy.org/resources/uniprotkb-swiss-prot, 2023
  24. “Plant Bioinformatics” Springer, 2007, pp. 89–112
  25. Carl Ivar Branden and John Tooze “Introduction to protein structure” Garland Science, 2012
  26. “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
  27. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  28. Martin Steinegger, Milot Mirdita and Johannes Söding “Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold” In Nature methods 16.7 Nature Publishing Group, 2019, pp. 603–606
  29. “Clustering huge protein sequence sets in linear time” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–8
  30. Iz Beltagy, Kyle Lo and Arman Cohan “SciBERT: A pretrained language model for scientific text” In arXiv preprint arXiv:1903.10676, 2019
  31. Suzanne Fricke “Semantic scholar” In Journal of the Medical Library Association: JMLA 106.1 Medical Library Association, 2018, pp. 145
  32. “Galactica: A large language model for science” In arXiv preprint arXiv:2211.09085, 2022
  33. “ChatPathway: Conversational Large Language Models for Biology Pathway Detection” In NeurIPS 2023 AI for Science Workshop, 2023
  34. Neil Savage “Drug discovery companies are customizing ChatGPT: here’s how” In Nature Biotechnology, 2023
  35. “Difformer: Empowering Diffusion Model on Embedding Space for Text Generation” In arXiv preprint arXiv:2212.09412, 2022
  36. “GENIE: Large Scale Pre-training for Text Generation with Diffusion Model” In arXiv preprint arXiv:2212.11685, 2022
  37. “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features” In Biopolymers: Original Research on Biomolecules 22.12 Wiley Online Library, 1983, pp. 2577–2637
  38. “Global analysis of protein folding using massively parallel design, synthesis, and testing” In Science 357.6347 American Association for the Advancement of Science, 2017, pp. 168–175
  39. “The protein data bank” In Nucleic acids research 28.1 Oxford University Press, 2000, pp. 235–242
  40. “A Multi-Grained Group Symmetric Framework for Learning Protein-Ligand Binding Dynamics” under review In Submitted to The Twelfth International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=J4V3lW9hq6
  41. “GNINA 1.0: molecular docking with deep learning” In Journal of cheminformatics 13.1 BioMed Central, 2021, pp. 1–20
  42. “Evaluating protein transfer learning with TAPE” In Advances in neural information processing systems 32, 2019
  43. “NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning” In Proteins: Structure, Function, and Bioinformatics 87.6 Wiley Online Library, 2019, pp. 520–527
  44. Jie Hou, Badri Adhikari and Jianlin Cheng “DeepSF: deep convolutional neural network for mapping protein sequences to folds” In Bioinformatics 34.8 Oxford University Press, 2018, pp. 1295–1303
  45. Naomi K Fox, Steven E Brenner and John-Marc Chandonia “SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures” In Nucleic Acids Research 42.D1 Oxford University Press, 2013, pp. D304–D309
  46. Mohammed AlQuraishi “ProteinNet: a standardized data set for machine learning of protein structure” In BMC Bioinformatics 20.1 Springer, 2019, pp. 1–10
  47. “Critical assessment of methods of protein structure prediction (CASP)-Round XII” In Proteins: Structure, Function, and Bioinformatics 86 John Wiley & Sons, Ltd, 2018, pp. 7–15 DOI: 10.1002/prot.25415
  48. “The protein data bank” In Nucleic Acids Research 28.1 Oxford University Press, 2000, pp. 235–242
  49. “Local fitness landscape of the green fluorescent protein” In Nature 533.7603 Nature Publishing Group, 2016, pp. 397
  50. “Long short-term memory” In Neural computation 9.8 MIT Press, 1997, pp. 1735–1780
  51. “Proceedings of the IEEE conference on computer vision and pattern recognition”, 2016, pp. 770–778
  52. “Ontoprotein: Protein pretraining with gene ontology embedding” In arXiv preprint:2201.11147, 2022
  53. “Illuminating protein space with a programmable generative model” In bioRxiv Cold Spring Harbor Laboratory, 2022 DOI: 10.1101/2022.12.01.518682
  54. “PubTator central” In Nucleic Acids Research 47.W1, 2019, pp. W587–W593 DOI: 10.1093/nar/gkz389
  55. “International Conference on Learning Representations”, 2020
  56. “Dual use of artificial-intelligence-powered drug discovery” In Nature Machine Intelligence 4.3, 2022, pp. 189–191 DOI: 10.1038/s42256-022-00465-9
  57. “Neural networks to learn protein sequence–function relationships from deep mutational scanning data” In Proceedings of the National Academy of Sciences 118.48, 2021, pp. e2104878118 DOI: 10.1073/pnas.2104878118
  58. “ECNet is an evolutionary context-integrated deep learning framework for protein engineering” In Nature Communications 12.1, 2021, pp. 5743 DOI: 10.1038/s41467-021-25976-8
  59. “Low-N protein engineering with data-efficient deep learning” In Nature Methods 18.4, 2021, pp. 389–396 DOI: 10.1038/s41592-021-01100-y
  60. “Proceedings of the 39th International Conference on Machine Learning” PMLR, 2022, pp. 16990–17017 URL: https://proceedings.mlr.press/v162/notin22a.html
  61. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  62. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv preprint arXiv:1910.13461, 2019
  63. “Exploring the limits of transfer learning with a unified text-to-text transformer.” In J. Mach. Learn. Res. 21.140, 2020, pp. 1–67
  64. Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
  65. Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
  66. “Generative modeling by estimating gradients of the data distribution” In Advances in Neural Information Processing Systems 32, 2019
  67. “Score-based generative modeling through stochastic differential equations” In arXiv preprint arXiv:2011.13456, 2020
  68. “Learning deep representations by mutual information estimation and maximization” In arXiv preprint arXiv:1808.06670, 2018
  69. Philip Bachman, R Devon Hjelm and William Buchwalter “Learning representations by maximizing mutual information across views” In Advances in Neural Information Processing Systems 32, 2019
  70. Aaron van den Oord, Yazhe Li and Oriol Vinyals “Representation learning with contrastive predictive coding” In arXiv preprint arXiv:1807.03748, 2018
  71. “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2020, pp. 9729–9738
  72. “International Conference on Learning Representations”, 2022 URL: https://openreview.net/forum?id=xQUe1pOKPam
  73. “A tutorial on energy-based learning” In Predicting structured data 1.0, 2006
  74. “Supervised contrastive learning” In arXiv preprint arXiv:2004.11362, 2020
  75. Shengchao Liu, Hongyu Guo and Jian Tang “International Conference on Learning Representations”, 2023 URL: https://openreview.net/forum?id=CjTHVo1dvR
  76. “Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020” Association for Computational Linguistics, 2020, pp. 7871–7880 DOI: 10.18653/v1/2020.acl-main.703
  77. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In J. Mach. Learn. Res. 21, 2020, pp. 140:1–140:67 URL: http://jmlr.org/papers/v21/20-074.html
  78. “Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020” ISCA, 2020, pp. 4676–4680 DOI: 10.21437/Interspeech.2020-1066
  79. “IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019” IEEE, 2019, pp. 449–456 DOI: 10.1109/ASRU46091.2019.9003750
  80. “Muse: Text-To-Image Generation via Masked Generative Transformers” In arXiv preprint arXiv:2301.00704, 2023
  81. Yang Song and Diederik P Kingma “How to train your energy-based models” In arXiv preprint arXiv:2101.03288, 2021
  82. “Argmax flows and multinomial diffusion: Learning categorical distributions” In Advances in Neural Information Processing Systems 34, 2021, pp. 12454–12465
  83. “Structured denoising diffusion models in discrete state-spaces” In Advances in Neural Information Processing Systems 34, 2021, pp. 17981–17993
  84. “Diffusion-LM Improves Controllable Text Generation” In arXiv preprint arXiv:2205.14217, 2022
  85. “Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII”, 2022, pp. 170–188 Springer
  86. “International Conference on Machine Learning”, 2021, pp. 8821–8831 PMLR
  87. Tero Karras, Samuli Laine and Timo Aila “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2019, pp. 4401–4410
  88. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In arXiv preprint arXiv:2205.11487, 2022
  89. “Structured Multi-View Representations for Drug Combinations” In Machine Learning for Molecules Workshop at NeurIPS, 2020
  90. “A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language” In arXiv preprint arXiv:2209.05481, 2022
  91. “Protein design and variant prediction using autoregressive generative models” In Nature Communications 12.1, 2021, pp. 2403 DOI: 10.1038/s41467-021-22732-w
  92. Noelia Ferruz, Steffen Schmidt and Birte Höcker “ProtGPT2 is a deep unsupervised language model for protein design” In Nature Communications 13.1, 2022, pp. 4348 DOI: 10.1038/s41467-022-32007-7
  93. “Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering” In Thirty-seventh Conference on Neural Information Processing Systems, 2023
  94. Zak Costello and Hector Garcia Martin “How to Hallucinate Functional Proteins” In arXiv:1903.00458 [q-bio], 2019 DOI: 10.48550/arXiv.1903.00458
  95. “A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences” In Cell Systems 11.1, 2020, pp. 49–62.e16 DOI: 10.1016/j.cels.2020.05.007
  96. “Generating functional protein variants with variational autoencoders” In PLOS Computational Biology 17.2, 2021, pp. e1008736 DOI: 10.1371/journal.pcbi.1008736
  97. “Transformer-based protein generation with regularized latent space optimization” In Nature Machine Intelligence 4.10, 2022, pp. 840–851 DOI: 10.1038/s42256-022-00532-1
  98. “ProT-VAE” In bioRxiv, 2023 DOI: 10.1101/2023.01.23.525232
  99. “De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks” In Journal of Chemical Information and Modeling 60.12, 2020, pp. 5667–5681 DOI: 10.1021/acs.jcim.0c00593
  100. “Expanding functional protein sequence spaces using generative adversarial networks” In Nature Machine Intelligence 3.4, 2021, pp. 324–333 DOI: 10.1038/s42256-021-00310-5
  101. Damiano Sgarbossa, Umberto Lupo and Anne-Florence Bitbol “Generative power of a protein language model trained on multiple sequence alignments” In eLife 12, 2023, pp. e79854 DOI: 10.7554/eLife.79854
  102. “Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
  103. “Binding peptide generation for MHC Class I proteins with deep reinforcement learning” In Bioinformatics 39.2, 2023, pp. btad055
  104. “Designing a Prospective COVID-19 Therapeutic with Reinforcement Learning” In Machine Learning for Molecules Workshop at the 34th Conference on Neural Information Processing Systems, 2020
  105. “Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
  106. “T-Cell Receptor Optimization with Reinforcement Learning and Mutation Polices for Precision Immunotherapy” In Research in Computational Molecular Biology Cham: Springer Nature Switzerland, 2023, pp. 174–191
  107. Wenze Ding, Kenta Nakai and Haipeng Gong “Protein design via deep learning” In Briefings in Bioinformatics 23.3, 2022, pp. bbac102 DOI: 10.1093/bib/bbac102
  108. “Advances in Neural Information Processing Systems” Curran Associates, Inc., 2019
  109. “Robust deep learning–based protein sequence design using ProteinMPNN” In Science 378.6615, 2022, pp. 49–56 DOI: 10.1126/science.add2187
  110. “Structure-informed Language Models Are Protein Designers” In bioRxiv, 2023, pp. 2023.02.03.526917 DOI: 10.1101/2023.02.03.526917
  111. “Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models” In arXiv preprint arXiv:2205.15019, 2022
  112. Jin Sub Lee and Philip M. Kim “ProteinSGM” In bioRxiv, 2022, pp. 2022.07.13.499967 DOI: 10.1101/2022.07.13.499967
  113. “Protein structure generation via folding diffusion” arXiv:2209.15611 [cs, q-bio] In arXiv:2209.15611 [q-bio], 2022 DOI: 10.48550/arXiv.2209.15611
  114. “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models” In bioRxiv, 2022, pp. 2022.12.09.519842 DOI: 10.1101/2022.12.09.519842
  115. “A high-level programming language for generative protein design” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521526
  116. “Language models generalize beyond natural proteins” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521521
  117. “Diffusion models beat gans on image synthesis” In Advances in Neural Information Processing Systems 34, 2021, pp. 8780–8794
  118. “International conference on machine learning”, 2020, pp. 1597–1607 PMLR
  119. “Big self-supervised models are strong semi-supervised learners” In Advances in neural information processing systems 33, 2020, pp. 22243–22255
  120. “Cdconv: A benchmark for contradiction detection in chinese conversations” In arXiv preprint arXiv:2210.08511, 2022
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Shengchao Liu (30 papers)
  2. Yanjing Li (26 papers)
  3. Zhuoxinran Li (5 papers)
  4. Anthony Gitter (17 papers)
  5. Yutao Zhu (63 papers)
  6. Jiarui Lu (31 papers)
  7. Zhao Xu (47 papers)
  8. Weili Nie (41 papers)
  9. Arvind Ramanathan (31 papers)
  10. Chaowei Xiao (110 papers)
  11. Jian Tang (327 papers)
  12. Hongyu Guo (48 papers)
  13. Anima Anandkumar (236 papers)
Citations (55)