Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization (2410.19471v1)

Published 25 Oct 2024 in cs.LG and cs.AI

Abstract: Inverse folding models play an important role in structure-based design by predicting amino acid sequences that fold into desired reference structures. Models like ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. However, when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure. To address this, we fine-tune ProteinMPNN to produce diverse and structurally consistent peptide sequences via Direct Preference Optimization (DPO). We derive two enhancements to DPO: online diversity regularization and domain-specific priors. Additionally, we develop a new understanding on improving diversity in decoder models. When conditioned on OpenFold generated structures, our fine-tuned models achieve state-of-the-art structural similarity scores, improving base ProteinMPNN by at least 8%. Compared to standard DPO, our regularized method achieves up to 20% higher sequence diversity with no loss in structural similarity score.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. The past, present and future of protein-based materials. Open Biology, 8(10), October 2018. ISSN 2046-2441. doi: 10.1098/rsob.180113. URL http://dx.doi.org/10.1098/rsob.180113.
  2. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022. doi: 10.1101/2022.11.20.517210. URL https://www.biorxiv.org/content/10.1101/2022.11.20.517210.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334029.
  4. Design and application of stimulus-responsive peptide systems. Protein Engineering, Design & Selection, 20(4):155–161, 2007.
  5. Cell-penetrating peptides: design, synthesis, and applications. ACS nano, 8(3):1972–1994, 2014.
  6. Paul E. Correa. The building of protein structures from alpha-carbon coordinates. Proteins: Structure, Function, and Bioinformatics, 7(4):366–377, 1990. doi: https://doi.org/10.1002/prot.340070408. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.340070408.
  7. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022a. doi: 10.1126/science.add2187. URL https://www.science.org/doi/abs/10.1126/science.add2187.
  8. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022b.
  9. The protein folding problem. Annual Review of Biophysics, 37(1):289–316, June 2008. ISSN 1936-1238. doi: 10.1146/annurev.biophys.37.092707.153558. URL http://dx.doi.org/10.1146/annurev.biophys.37.092707.153558.
  10. Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
  11. Knowledge-design: Pushing the limit of protein design via knowledge refinement, 2023a. URL https://arxiv.org/abs/2305.15151.
  12. Proteininvbench: Benchmarking protein inverse folding on diverse tasks, models, and metrics. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=bqXduvuW5E.
  13. De novo and inverse folding predictions of protein structure and dynamics. J. Comput. Aided Mol. Des., 7(4):397–438, August 1993.
  14. Learning inverse folding from millions of predicted structures. ICML, April 2022a. doi: 10.1101/2022.04.10.487779. URL http://dx.doi.org/10.1101/2022.04.10.487779.
  15. Learning inverse folding from millions of predicted structures. bioRxiv, 2022b. doi: 10.1101/2022.04.10.487779. URL https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779.
  16. Generative models for graph-based protein design. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf.
  17. Equivariant graph neural networks for 3d macromolecular structure. arXiv preprint arXiv:2106.03843, 2021a.
  18. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=1YLJDvSx6J4.
  19. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. URL https://doi.org/10.1038/s41586-021-03819-2.
  20. Adam: A method for stochastic optimization. 2014.
  21. De novo protein design. i. in search of stability and specificity11edited by f. e. cohen. Journal of Molecular Biology, 293(5):1161–1181, 1999. ISSN 0022-2836. doi: https://doi.org/10.1006/jmbi.1999.3211. URL https://www.sciencedirect.com/science/article/pii/S0022283699932114.
  22. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86., 1951.
  23. Macromolecular modeling and design in rosetta: recent methods and frameworks. Nat. Methods, 17(7):665–680, July 2020.
  24. Evolutionary-scale prediction of atomic level protein structure with a language model. July 2022. doi: 10.1101/2022.07.20.500902. URL http://dx.doi.org/10.1101/2022.07.20.500902.
  25. Predicting the conformations of peptides and proteins in early evolution. a review article submitted to biology direct. Biol. Direct, 3(1):3, January 2008.
  26. Colabfold: making protein folding accessible to all. Nature Methods, 19(6):679–682, June 2022. doi: 10.1038/s41592-022-01488-1. URL https://doi.org/10.1038/s41592-022-01488-1.
  27. Preference optimization of protein language models as a multi-objective binder design paradigm, 2024. URL https://arxiv.org/abs/2403.04187.
  28. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
  29. Preference optimization for molecular language models, 2023. URL https://arxiv.org/abs/2310.12304.
  30. Disentangling length from quality in direct preference optimization, 2024. URL https://arxiv.org/abs/2403.19159.
  31. PyTorch: An imperative style, high-performance deep learning library. 2019.
  32. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321.
  33. Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution. December 2023. doi: 10.1101/2023.12.19.572475. URL http://dx.doi.org/10.1101/2023.12.19.572475.
  34. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, 2017. doi: 10.1038/nbt.3988. URL https://doi.org/10.1038/nbt.3988.
  35. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res. Notes, 5(1):85, February 2012.
  36. Peptide design principles for antimicrobial applications. Journal of molecular biology, 431(18):3547–3567, 2019.
  37. Designing peptide based nanomaterials. Chemical Society Reviews, 37(4):664–675, 2008.
  38. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023. URL https://arxiv.org/abs/2309.16240.
  39. Aligning protein generative models with experimental fitness via direct preference optimization. bioRxiv, 2024. doi: 10.1101/2024.05.20.595026. URL https://www.biorxiv.org/content/early/2024/05/21/2024.05.20.595026.
  40. Alphafold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 8(1), March 2023. ISSN 2059-3635. doi: 10.1038/s41392-023-01381-z. URL http://dx.doi.org/10.1038/s41392-023-01381-z.
  41. Graph denoising diffusion for inverse protein folding. 2023.
  42. Inverse protein folding problem: designing polymer sequences. Proceedings of the National Academy of Sciences, 89(9):4163–4167, 1992.
  43. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004. ISSN 1097-0134. doi: 10.1002/prot.20264. URL http://bioinformatics.buffalo.edu/TM-score. Copyright 2004 Wiley-Liss, Inc.
  44. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization, 2024. URL https://arxiv.org/abs/2310.03708.
  45. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp.  1433–1438. AAAI Press, 2008. ISBN 9781577353683.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com