Diffusion Language Models Are Versatile Protein Learners (2402.18567v2)

Published 28 Feb 2024 in cs.LG and q-bio.BM

Abstract: This paper introduces diffusion protein LLM (DPLM), a versatile protein LLM that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes LLMing for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{https://github.com/bytedance/dplm}.

References (123)

Authors (6)

Xinyou Wang (5 papers)
Zaixiang Zheng (25 papers)
Fei Ye (78 papers)
Dongyu Xue (9 papers)
Shujian Huang (106 papers)
Quanquan Gu (198 papers)

Citations (15)

View on Semantic Scholar

Summary

Overview of "Diffusion LLMs Are Versatile Protein Learners"

The paper presents the Diffusion Protein LLM (DPLM), a protein LLM designed to enhance both generative and predictive tasks related to protein sequences. DPLM leverages a discrete diffusion probabilistic framework to generalize LLMing for proteins effectively. This paper positions DPLM within a landscape where conventional masked and autoregressive LLMs have limitations, particularly in capturing and generating protein sequences due to their inherent structure.

Key Contributions

Framework and Architecture:
- DPLM is rooted in a discrete diffusion probabilistic framework which handles the inherent discreteness of amino acid sequences, unlike continuous diffusion frameworks that require continuous relaxations.
- It offers both generative and representation learning capabilities, catering to a comprehensive range of tasks from unconditional generation to structure-conditioned sequence design.
Generative Capabilities:
- DPLM is validated in its ability to generate sequences that fold into structurally plausible forms with a high average #1{pLDDT} score exceeding 80 across various lengths.
- It fosters structural diversity and novelty, as evidenced by its capacity to produce foldable sequences with novel structures compared to known Protein Data Bank (PDB) structures.
Representation Learning:
- Comparatively superior to ESM2 and other masked LLMs, DPLM shows enhanced performance across several protein predictive tasks such as thermostability prediction, protein-protein interaction, and metal ion binding classification.
- This enhancement is attributed to DPLM's diffusion pre-training, leading to deeper contextual understanding and improved predictions.
Extensible Conditioning Strategies:
- DPLM is versatile in conditional generation, including sequence conditioning, multi-modal conditioning, and controllable generation via discrete classifier guidance.
- It demonstrates applications such as motif scaffolding, inverse protein folding with superior structure-validation metrics like #1{scTM} and #1{pLDDT}, and secondary structure-guided synthesis.

Implications and Future Directions

The introduction of DPLM marks a significant innovation in protein modeling. Its adoption of the diffusion model framework caters well to the sequential and structured nature of proteins, demonstrating the potential of DPLM to bridge gaps left by previous models. The improvement in structural plausibility provides opportunities for real-world applications in protein design, including therapeutics and enzyme modeling.

Practically, DPLM's conditioning capabilities mean it can be directly applied to complex tasks such as antibody design or ligand binding in drug discovery. The capacity for structural adaptability suggests it might be expanded into modeling other biological polymers such as RNAs or DNAs, furthering our understanding of molecular biology's central tenets.

As the paper suggests, future research might explore ways to incorporate structural information more directly into DPLM or extend its architecture to accommodate longer sequences, given its foundational flexibility and demonstrated early promise with complex proteins. With advancements in parallel computation and model scaling, DPLM could further advance the synthesis and functional understanding of proteins, potentially revolutionizing approaches in bioinformatics, synthetic biology, and beyond.

Overall, DPLM sets a precedent for utilizing diffusion-based approaches in biological sequence modeling, expanding the toolkit available for addressing biological research and application challenges.

PDF Markdown

Tweets

https://twitter.com/ml4proteins/status/1794021529295995024

https://twitter.com/zaixiang_zheng/status/1868946138965082353

https://twitter.com/jsye588986/status/1875020714895683939

YouTube

Show All Videos