Preference optimization of protein language models as a multi-objective binder design paradigm
Abstract: We present a multi-objective binder design paradigm based on instruction fine-tuning and direct preference optimization (DPO) of autoregressive protein LLMs (pLMs). Multiple design objectives are encoded in the LLM through direct optimization on expert curated preference sequence datasets comprising preferred and dispreferred distributions. We show the proposed alignment strategy enables ProtGPT2 to effectively design binders conditioned on specified receptors and a drug developability criterion. Generated binder samples demonstrate median isoelectric point (pI) improvements by $17\%-60\%$.
- In silico evolution of protein binders with deep learning models for structure prediction and sequence design. bioRxiv [Preprint], 2023.
- Pepnn: a deep attention model for the identification of peptide binding sites. Communications biology, 5(1):503, 2022.
- Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp. 2023–09, 2023.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Peptide binder design with inverse folding and protein structure prediction. Communications Chemistry, 6(229), 2023.
- Revolutionizing peptide-based drug discovery: Advances in the post-alphafold era. WIREs Computational Molecular Science, 14(1):e1693, 2024. doi: https://doi.org/10.1002/wcms.1693. URL https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1693.
- Pepmlm: Target sequence-conditioned generation of peptide binders via masked language modeling. ArXiv, 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Generative antibody design for complementary chain pairing sequences through encoder-decoder language model. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
- Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Peptide therapeutics: current status and future directions. Drug Discovery Today, 20(1):122–128, 2015. ISSN 1359-6446. doi: https://doi.org/10.1016/j.drudis.2014.10.003. URL https://www.sciencedirect.com/science/article/pii/S1359644614003997.
- Protein design with guided discrete diffusion. arXiv preprint arXiv:2305.20009, 2023.
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
- Solubility-aware protein binding peptide design using alphafold. Biomedicines, 10(7), 2022. ISSN 2227-9059. doi: 10.3390/biomedicines10071626. URL https://www.mdpi.com/2227-9059/10/7/1626.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022.
- Pairing interacting protein sequences using masked language modeling. bioRxiv, pp. 2023–08, 2023.
- Propedia v2. 3: A novel representation approach for the peptide-protein interaction database using graph-based structural signatures. Frontiers in Bioinformatics, 3:1103103, 2023.
- Progen2: exploring the boundaries of protein language models. Cell Systems, 14(11):968–978, 2023.
- OpenAI. Chatml documentation. https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md, 2023. Accessed: February 5, 2024.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Preference optimization for molecular language models. arXiv preprint arXiv:2310.12304, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Iglm: Infilling language modeling for antibody sequence design. Cell Systems, 14(11):979–989, 2023.
- Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
- A multi-modal contrastive diffusion model for therapeutic peptide generation, 2024.
- De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- Helixdiff: Hotspot-specific full-atom design of peptides using diffusion models. In Machine Learning for Structural Biology Workshop, NeurIPS, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.