Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion on language model embeddings for protein sequence generation (2403.03726v1)

Published 6 Mar 2024 in cs.LG, cs.AI, and q-bio.BM

Abstract: Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein LLM, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, 2023. doi: 10.1101/2023.09.11.556673. URL https://www.biorxiv.org/content/early/2023/09/12/2023.09.11.556673.
  2. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
  3. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  4. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22669–22679, 2023.
  5. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  6. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  7. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.add2187. URL https://www.science.org/doi/10.1126/science.add2187.
  8. What is hidden in the darkness? deep-learning assisted large-scale protein family curation uncovers novel protein families and folds. bioRxiv, 2023. doi: 10.1101/2023.03.14.532539. URL https://www.biorxiv.org/content/early/2023/03/19/2023.03.14.532539.
  9. Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology, 6(3):197–208, March 2005. ISSN 1471-0072, 1471-0080. doi: 10.1038/nrm1589. URL https://www.nature.com/articles/nrm1589.
  10. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, October 2022. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2021.3095381. URL https://ieeexplore.ieee.org/document/9477085/.
  11. Difformer: Empowering diffusion model on embedding space for text generation. arXiv preprint arXiv:2212.09412, 2022.
  12. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012. URL http://jmlr.org/papers/v13/gretton12a.html.
  13. Likelihood-based diffusion language models. arXiv preprint arXiv:2305.18619, 2023.
  14. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  17. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  18. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  19. Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Communications Biology, 6(1):1–15, January 2023. ISSN 2399-3642. doi: 10.1038/s42003-023-04462-5. URL https://www.nature.com/articles/s42003-023-04462-5.
  20. Learning inverse folding from millions of predicted structures. 162:8946–8970, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hsu22a.html.
  21. Interproscan 5: genome-scale protein function classification. Bioinformatics, 30(9):1236–1240, 2014.
  22. Evaluation metrics for protein structure generation. ICML, 12(1), June 2023. ISSN 2041-1723. doi: 10.1101/2023.09.11.556673. URL https://icml.cc/virtual/2023/28971.
  23. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. URL https://www.nature.com/articles/s41586-021-03819-2.
  24. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules, 22(12):2577–2637, 1983.
  25. Karpathy, A. nanoGPT, 2023. URL https://github.com/karpathy/nanoGPT.
  26. Score-based generative modeling for de novo protein design. Nature Computational Science, 3(5):382–392, May 2023. ISSN 2662-8457. doi: 10.1038/s43588-023-00440-3. URL https://www.nature.com/articles/s43588-023-00440-3.
  27. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  28. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. 2023.
  29. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023a. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
  30. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pp.  21051–21064. PMLR, 2023b.
  31. Latent diffusion for language generation. arXiv preprint arXiv:2212.09462, 2022.
  32. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, August 2023. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-022-01618-2. URL https://www.nature.com/articles/s41587-022-01618-2.
  33. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536–540, 1995.
  34. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  35. The superfamily 1.75 database in 2014: a doubling of data. Nucleic acids research, 43(D1):D227–D233, 2015.
  36. Structure-based protein design with deep learning. Current Opinion in Chemical Biology, 65:136–144, 2021. ISSN 1367-5931. doi: https://doi.org/10.1016/j.cbpa.2021.08.004. URL https://www.sciencedirect.com/science/article/pii/S1367593121001125. Mechanistic Biology * Machine Learning in Chemical Biology.
  37. Interpro in 2022. Nucleic acids research, 51(D1):D418–D427, 2023.
  38. Mobidb 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic acids research, 46(D1):D471–D476, 2018.
  39. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp.  8599–8608. PMLR, 2021.
  40. Expanding functional protein sequence spaces using generative adversarial networks. Nature Machine Intelligence, 3(4):324–333, March 2021. ISSN 2522-5839. doi: 10.1038/s42256-021-00310-5. URL https://www.nature.com/articles/s42256-021-00310-5.
  41. Pseudolikelihood reranking with masked language models. CoRR, abs/1910.14659, 2019. URL http://arxiv.org/abs/1910.14659.
  42. Prot-vae: Protein transformer variational autoencoder for functional protein design. bioRxiv, 2023. doi: 10.1101/2023.01.23.525232. URL https://www.biorxiv.org/content/early/2023/01/24/2023.01.23.525232.
  43. Protein design and variant prediction using autoregressive generative models. Nature Communications, 12(1):2403, April 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-22732-w. URL https://www.nature.com/articles/s41467-021-22732-w.
  44. Intrinsic structural dynamics dictate enzymatic activity and inhibition. Proceedings of the National Academy of Sciences, 120(41):e2310910120, October 2023. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.2310910120. URL https://pnas.org/doi/10.1073/pnas.2310910120.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  46. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  47. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  48. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, November 2017. ISSN 1546-1696. doi: 10.1038/nbt.3988. URL https://www.nature.com/articles/nbt.3988.
  49. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
  50. Highly accurate protein structure prediction for the human proteome. Nature, 596(7873):590–596, August 2021. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03828-1. URL https://www.nature.com/articles/s41586-021-03828-1.
  51. Uversky, V. N. Functional roles of transiently and intrinsically disordered regions within proteins. The FEBS journal, 282(7):1182–1189, 2015.
  52. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, May 2023. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01773-0. URL https://www.nature.com/articles/s41587-023-01773-0.
  53. Self-play reinforcement learning guides protein engineering. Nature Machine Intelligence, 5(8):845–860, 2023.
  54. De novo design of protein structure and function with RFdiffusion. Nature, 620(7976):1089–1100, August 2023. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-023-06415-8. URL https://www.nature.com/articles/s41586-023-06415-8.
  55. Protein structure generation via folding diffusion. 2022.
  56. Protein sequence design with deep generative models. Current Opinion in Chemical Biology, 65:18–27, 2021. ISSN 1367-5931. doi: https://doi.org/10.1016/j.cbpa.2021.04.004. URL https://www.sciencedirect.com/science/article/pii/S136759312100051X. Mechanistic Biology * Machine Learning in Chemical Biology.
  57. Dinoiser: Diffused conditional sequence learning by manipulating noises. arXiv preprint arXiv:2302.10025, 2023.
  58. Seqdiffuseq: Text diffusion with encoder-decoder transformers. arXiv preprint arXiv:2212.10325, 2022.
  59. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, December 2004. ISSN 0887-3585, 1097-0134. doi: 10.1002/prot.20264. URL https://onlinelibrary.wiley.com/doi/10.1002/prot.20264.
  60. Planner: Generating diversified paragraph via latent language diffusion model. arXiv preprint arXiv:2306.02531, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Viacheslav Meshchaninov (6 papers)
  2. Pavel Strashnov (2 papers)
  3. Andrey Shevtsov (1 paper)
  4. Fedor Nikolaev (1 paper)
  5. Nikita Ivanisenko (1 paper)
  6. Olga Kardymon (1 paper)
  7. Dmitry Vetrov (84 papers)
Citations (4)

Summary

A Comprehensive Overview of Submission and Formatting Instructions for ICML 2024

Introduction

The annual International Conference on Machine Learning (ICML) stands as a pivotal gathering for scholars, researchers, and professionals within the machine learning domain to exchange insights, progress, and forecasts about the discipline's trajectory. The 2024 submission guidelines serve as a beacon of structure for prospective contributors, delineating the requisite formatting, submission procedures, and ethical considerations indispensable for deliberation in the conference proceedings. This blog post aims to simplify and elucidate these cardinal regulations, ensuring authors navigate the submission landscape with efficacy and compliance.

Electronic Submission and Preparations

Submissions for ICML 2024 pivot entirely on an electronic interface, dismissing any form of email or hard copy submissions. In a novel twist, appendices must be amalgamated with the main manuscript and references into a singular file for submission, adhering strictly to a PDF format. This consolidation aims at steering clear of oversight during the review process. Paramount details include:

  • PDF Format Exclusivity: The manuscript, inclusive of appendices, must abide strictly to a PDF format.
  • Page Limitations: An 8-page limit is enforced on the main body of the paper, with appendices and references permitted additional space. Authors of accepted papers will have the leverage to expand the main body by an extra page in their final submission.
  • Author Anonymity: Initial submissions must obscure author identities, a decree supporting the double-blind review ethos of ICML.

Style and Formatting Nuances

The document’s stylistic and typographic elements bear significant weight. The adherence to a 10 point Times font throughout the textual content is mandatory, punctuated by exacting specifications regarding figure captions, table placements, and the encapsulation of references. Critical notations entail:

  • Font Integrity: The mandatory use of Type-1 fonts to avert complications in the transcoding of the document.
  • Element Positioning: Specific directives on the placement and formatting of figures, tables, and references to maintain consistency and readability.
  • Reference Formatting: A chronological ordering in citations, with a comprehensive detailing including page numbers where feasible, ensuring a uniform presentation of the bibliography.

Ethical Compliance and Anonymity

ICML’s staunch commitment to ethical scholarliness is evident through its stringent policies on simultaneous submissions, which are summarily rejected if found to be under consideration elsewhere. The anonymity clause extends to censoring any form of author identification within the submission text, fostering an unbiased review process. Additionally, any form of prior work by the authors should be cited in a manner that preserves the review's blind nature.

Evaluation Matrices and Ablation Studies

The guidelines underscore a distinctive emphasis on rigorous empirical evaluation, with a performance comparison table delineated in the paper serving as a template. The inclusion of an ablation paper serves not only to benchmark the proposed DiMA model against prevailing architectures but also highlights the incremental enhancements afforded by various model iterations, evidencing a methodical approach to model refinement.

Implications and Theoretical Contributions

While maintaining a detached narrative, it’s paramount to underscore the paper’s theoretical and practical implications within the machine learning community. The quantitative leaps in performance metrics postulated by the DiMA model accentuate its potential for improving predictive accuracies in protein sequence modeling. Moreover, the theoretical underpinnings detailed in the model's architecture propose a novel paradigm that may spur further research within the domain.

Conclusion and Future Directions

In sum, the ICML 2024 submission and formatting instructions provide a detailed blueprint for authors to follow, ensuring their work is presented in a coherent and standardized manner. The guidelines are designed to facilitate a fair and rigorous review process, encouraging the submission of high-quality papers that advance the field of machine learning. Through adherence to these guidelines, authors can contribute to the rich tapestry of knowledge that ICML represents, pushing the boundaries of what is possible in machine learning research.

X Twitter Logo Streamline Icon: https://streamlinehq.com