Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion Language Models Are Versatile Protein Learners (2402.18567v2)

Published 28 Feb 2024 in cs.LG and q-bio.BM

Abstract: This paper introduces diffusion protein LLM (DPLM), a versatile protein LLM that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes LLMing for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{https://github.com/bytedance/dplm}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (123)
  1. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp.  2023–09, 2023.
  2. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, volume 34, pp.  17981–17993, 2021.
  3. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  4. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  5. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
  6. Language models are few-shot learners. volume 33, pp.  1877–1901, 2020.
  7. A cheaper and better diffusion language model with soft-masked noise. arXiv preprint arXiv:2304.04746, 2023a.
  8. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023b.
  9. Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404, 2024a.
  10. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024b.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  12. Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pp.  2021–11, 2021.
  13. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  14. Atomic context-conditioned protein sequence design using ligandmpnn. Biorxiv, pp.  2023–12, 2023.
  15. Riemannian score-based generative modelling. Advances in Neural Information Processing Systems, 35:2406–2422, 2022.
  16. DeepMind, G. Performance and structural coverage of the latest, in-development alphafold model. 2023.
  17. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  18. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021a.
  19. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021b.
  20. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
  21. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  22. Controllable protein design with language models. Nature Machine Intelligence, 4(6):521–532, 2022.
  23. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
  24. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  25. Difformer: Empowering diffusion model on embedding space for text generation. arXiv preprint arXiv:2212.09412, 2022a.
  26. Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022b.
  27. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  6112–6121, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1633. URL https://www.aclweb.org/anthology/D19-1633.
  28. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  29. Non-autoregressive neural machine translation. In International Conference on Learning Representations, 2018.
  30. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
  31. Pre-training co-evolutionary protein representation via a pairwise masked language model. arXiv preprint arXiv:2110.15527, 2021.
  32. Diffusionbert: Improving generative masked language models with diffusion models. 2023.
  33. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  34. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
  35. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  36. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  37. Autoregressive diffusion models. In International Conference on Learning Representations, 2021a.
  38. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021b.
  39. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pp.  8867–8887. PMLR, 2022.
  40. Learning inverse folding from millions of predicted structures. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  8946–8970. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hsu22a.html.
  41. Exploring evolution-aware &-free protein language models as protein function predictors. In Advances in Neural Information Processing Systems, 2022.
  42. Directed acyclic transformer pre-training for high-quality non-autoregressive text generation. Transactions of the Association for Computational Linguistics, 2023.
  43. Generative models for graph-based protein design. In Advances in neural information processing systems, 2019.
  44. Illuminating protein space with a programmable generative model. Nature, pp.  1–9, 2023.
  45. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2020.
  46. Generating novel protein sequences using gibbs sampling of masked language models. bioRxiv, pp.  2021–01, 2021.
  47. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  48. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  49. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp.  5530–5540. PMLR, 2021.
  50. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  51. Generalized biomolecular modeling and design with rosettafold all-atom. bioRxiv, pp.  2023–10, 2023.
  52. Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, pp.  2022–07, 2022.
  53. Diffusion-lm improves controllable text generation. In Advances in Neural Information Processing Systems, volume abs/2205.14217, 2022.
  54. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv preprint arXiv:2301.12485, 2023.
  55. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022.
  56. Joint generation of protein sequence and structure with rosettafold sequence space diffusion. bioRxiv, pp.  2023–05, 2023.
  57. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  58. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, pp.  2020–09, 2020.
  59. Deep neural language modeling enables functional protein generation across families. bioRxiv, pp.  2021–07, 2021.
  60. Adversarial contrastive pre-training for protein sequences. arXiv preprint arXiv:2102.00466, 2021.
  61. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems, pp.  29287–29303, 2021.
  62. Reprogramming large pretrained language models for antibody sequence infilling. arXiv preprint arXiv:2210.07144, 2022.
  63. Efficient estimation of word representations in vector space. In Bengio, Y. and LeCun, Y. (eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.org/abs/1301.3781.
  64. Pre-training of deep bidirectional protein sequence representations with structural information. IEEE Access, 9:123912–123926, 2021.
  65. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  66. Transforming the language of life: transformer neural networks for protein prediction tasks. In Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics, pp.  1–8, 2020.
  67. Progen2: exploring the boundaries of protein language models. arXiv preprint arXiv:2206.13517, 2022.
  68. Tripletprot: deep representation learning of proteins based on siamese networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(6):3744–3753, 2021.
  69. OpenAI. Gpt-4 technical report, 2023.
  70. Cath–a hierarchic classification of protein domain structures. Structure, 5(8):1093–1109, 1997.
  71. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  72. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://aclanthology.org/N18-1202.
  73. The volctrans glat system: Non-autoregressive translation meets wmt21. WMT 2021, pp.  187, 2021.
  74. Diff-glat: Diffusion glancing transformer for parallel sequence to sequence learning. arXiv preprint arXiv:2212.10240, 2022.
  75. Improving language understanding by generative pre-training. 2018.
  76. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  77. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  78. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
  79. Msa transformer. In International Conference on Machine Learning, pp.  8844–8856. PMLR, 2021.
  80. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2019. doi: 10.1101/622803. URL https://www.biorxiv.org/content/10.1101/622803v4.
  81. High-resolution image synthesis with latent diffusion models, 2021.
  82. Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations, 2022.
  83. Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D. (eds.), International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.  2256–2265, Lille, France, 07–09 Jul 2015. PMLR, PMLR. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html.
  84. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  85. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
  86. Udsmprot: universal deep sequence models for protein classification. Bioinformatics, 36(8):2401–2409, 2020.
  87. Profile prediction: An alignment-based pre-training task for protein sequence models. arXiv preprint arXiv:2012.00195, 2020.
  88. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  89. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pp.  2023–10, 2023.
  90. Moss. https://github.com/OpenLMLab/MOSS, 2023.
  91. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, volume 27, pp.  3104–3112, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.
  92. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
  93. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  94. Llama: Open and efficient foundation language models, 2023a.
  95. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  96. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
  97. Learning functional properties of proteins with language models. Nature Machine Intelligence, 4(3):227–245, 2022.
  98. Fast and accurate protein structure search with foldseek. Nature Biotechnology, pp.  1–4, 2023.
  99. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, volume 30, pp.  5998–6008, 2017.
  100. Language models generalize beyond natural proteins. bioRxiv, pp.  2022–12, 2022.
  101. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2022.
  102. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
  103. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp.  30–36, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2304. URL https://www.aclweb.org/anthology/W19-2304.
  104. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  105. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  106. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
  107. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837, 2022b.
  108. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022a.
  109. High-resolution de novo structure prediction from primary sequence. BioRxiv, pp.  2022–07, 2022b.
  110. Ar-diffusion: Auto-regressive diffusion model for text generation. arXiv preprint arXiv:2305.09515, 2023.
  111. Modeling protein using large-scale pretrain language model. arXiv preprint arXiv:2108.07435, 2021.
  112. Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems, 35:35156–35173, 2022.
  113. Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
  114. Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, pp.  2022–05, 2022a.
  115. Masked inverse folding with sequence transfer for protein representation learning. bioRxiv, pp.  2022–05, 2022b.
  116. Diffusion language models can perform many tasks with scaling and instruction-finetuning. arXiv preprint arXiv:2308.12219, 2023a.
  117. Dinoiser: Diffused conditional sequence learning by manipulating noises. arXiv preprint arXiv:2302.10025, 2023b.
  118. Graph denoising diffusion for inverse protein folding. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=u4YXKKG5dX.
  119. Se (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277, 2023.
  120. Seqdiffuseq: Text diffusion with encoder-decoder transformers. arXiv preprint arXiv:2212.10325, 2022.
  121. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  122. A reparameterized discrete diffusion model for text generation. arXiv preprint arXiv:2302.05737, 2023a.
  123. Structure-informed language models are protein designers. In International Conference on Machine Learning, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinyou Wang (5 papers)
  2. Zaixiang Zheng (25 papers)
  3. Fei Ye (78 papers)
  4. Dongyu Xue (9 papers)
  5. Shujian Huang (106 papers)
  6. Quanquan Gu (198 papers)
Citations (15)

Summary

Overview of "Diffusion LLMs Are Versatile Protein Learners"

The paper presents the Diffusion Protein LLM (DPLM), a protein LLM designed to enhance both generative and predictive tasks related to protein sequences. DPLM leverages a discrete diffusion probabilistic framework to generalize LLMing for proteins effectively. This paper positions DPLM within a landscape where conventional masked and autoregressive LLMs have limitations, particularly in capturing and generating protein sequences due to their inherent structure.

Key Contributions

  1. Framework and Architecture:
    • DPLM is rooted in a discrete diffusion probabilistic framework which handles the inherent discreteness of amino acid sequences, unlike continuous diffusion frameworks that require continuous relaxations.
    • It offers both generative and representation learning capabilities, catering to a comprehensive range of tasks from unconditional generation to structure-conditioned sequence design.
  2. Generative Capabilities:
    • DPLM is validated in its ability to generate sequences that fold into structurally plausible forms with a high average #1{pLDDT} score exceeding 80 across various lengths.
    • It fosters structural diversity and novelty, as evidenced by its capacity to produce foldable sequences with novel structures compared to known Protein Data Bank (PDB) structures.
  3. Representation Learning:
    • Comparatively superior to ESM2 and other masked LLMs, DPLM shows enhanced performance across several protein predictive tasks such as thermostability prediction, protein-protein interaction, and metal ion binding classification.
    • This enhancement is attributed to DPLM's diffusion pre-training, leading to deeper contextual understanding and improved predictions.
  4. Extensible Conditioning Strategies:
    • DPLM is versatile in conditional generation, including sequence conditioning, multi-modal conditioning, and controllable generation via discrete classifier guidance.
    • It demonstrates applications such as motif scaffolding, inverse protein folding with superior structure-validation metrics like #1{scTM} and #1{pLDDT}, and secondary structure-guided synthesis.

Implications and Future Directions

The introduction of DPLM marks a significant innovation in protein modeling. Its adoption of the diffusion model framework caters well to the sequential and structured nature of proteins, demonstrating the potential of DPLM to bridge gaps left by previous models. The improvement in structural plausibility provides opportunities for real-world applications in protein design, including therapeutics and enzyme modeling.

Practically, DPLM's conditioning capabilities mean it can be directly applied to complex tasks such as antibody design or ligand binding in drug discovery. The capacity for structural adaptability suggests it might be expanded into modeling other biological polymers such as RNAs or DNAs, furthering our understanding of molecular biology's central tenets.

As the paper suggests, future research might explore ways to incorporate structural information more directly into DPLM or extend its architecture to accommodate longer sequences, given its foundational flexibility and demonstrated early promise with complex proteins. With advancements in parallel computation and model scaling, DPLM could further advance the synthesis and functional understanding of proteins, potentially revolutionizing approaches in bioinformatics, synthetic biology, and beyond.

Overall, DPLM sets a precedent for utilizing diffusion-based approaches in biological sequence modeling, expanding the toolkit available for addressing biological research and application challenges.

Youtube Logo Streamline Icon: https://streamlinehq.com