Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing (2402.16445v2)

Published 26 Feb 2024 in cs.CE and q-bio.BM

Abstract: LLMs have achieved remarkable performance in multiple NLP tasks. Under the premise that protein sequences constitute the protein language, Protein LLMs(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{https://github.com/PKU-YuanGroup/ProLLaMA} and \url{https://huggingface.co/GreatCaptainNemo}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7319–7328, 2021.
  2. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp.  2023–09, 2023.
  3. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  5. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  6. Learning the protein language: Evolution, structure, and function. Cell systems, 12(6):654–669, 2021.
  7. The protein data bank. Acta Crystallographica Section D: Biological Crystallography, 58(6):899–907, 2002.
  8. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
  9. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  10. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023.
  11. Davey, N. E. The functional importance of structure in unstructured protein regions. Current opinion in structural biology, 56:155–163, 2019.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  14. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  15. Controllable protein design with language models. Nature Machine Intelligence, 4(6):521–532, 2022.
  16. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
  17. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
  20. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736, 2023.
  21. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  22. Chatgpt: Jack of all trades, master of none. Information Fusion, pp.  101861, 2023.
  23. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, 2021.
  24. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  26. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, 2022.
  27. Automatic instruction optimization for open-source llm instruction tuning. arXiv preprint arXiv:2311.13246, 2023b.
  28. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp.  1–8, 2023.
  29. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  30. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  31. Design in the dark: learning deep generative models for de novo protein design. bioRxiv, pp.  2022–01, 2022.
  32. Progen2: exploring the boundaries of protein language models. Cell Systems, 14(11):968–978, 2023.
  33. Tranception: Protein fitness prediction with autoregressive transformers and inference-time retrieval. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  16990–17017. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/notin22a.html.
  34. The language of proteins: Nlp, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19:1750–1758, 2021.
  35. Recent advances in de novo protein design: Principles, methods, and applications. Journal of Biological Chemistry, 296, 2021.
  36. Interpro in 2022. Nucleic acids research, 51(D1):D418–D427, 2023.
  37. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
  38. Msa transformer. In International Conference on Machine Learning, pp.  8844–8856. PMLR, 2021.
  39. Proximal exploration for model-guided protein sequence design. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18520–18536. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ren22a.html.
  40. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  41. Importance weighted expectation-maximization for protein sequence design. arXiv preprint arXiv:2305.00386, 2023.
  42. Deep generative modeling for protein design. Current opinion in structural biology, 72:226–236, 2022.
  43. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
  44. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Instructprotein: Aligning human and protein language via knowledge instruction. arXiv preprint arXiv:2310.03269, 2023.
  49. Wright, S. et al. The roles of mutation, inbreeding, crossbreeding, and selection in evolution. 1932.
  50. Protst: Multi-modality learning of protein sequences and biomedical texts. arXiv preprint arXiv:2301.12040, 2023.
  51. Learned protein embeddings for machine learning. Bioinformatics, 34(15):2642–2648, 2018.
  52. Evaluating large language models at evaluating instruction following. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  53. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  54. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  55. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  56. Structure-informed language models are protein designers. bioRxiv, pp.  2023–02, 2023.
  57. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198, 2023.
  58. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Liuzhenghao Lv (7 papers)
  2. Zongying Lin (4 papers)
  3. Hao Li (803 papers)
  4. Yuyang Liu (27 papers)
  5. Jiaxi Cui (13 papers)
  6. Calvin Yu-Chian Chen (4 papers)
  7. Li Yuan (141 papers)
  8. Yonghong Tian (184 papers)
Citations (16)

Summary

Overview of ProLLaMA: A Multi-Task Protein LLM

The paper introduces ProLLaMA, an innovative protein LLM (ProLLM) designed for multi-task protein language processing. Unlike traditional ProLLMs which primarily focus on single tasks, typically de novo protein sequence generation, ProLLaMA addresses a broader spectrum of tasks by incorporating a training framework that extends a general LLM's (LLM) capabilities to protein sequences. This approach leverages advancements made in NLP LLMs to the protein language domain, resolving inherent limitations such as the lack of multi-task capabilities and insufficient understanding of natural language instructions.

Key Contributions and Results

The paper outlines the architectural and methodological advancements in the ProLLaMA model, which are crucial for its multi-task proficiency:

  1. Training Framework: The authors propose a two-stage training framework to transform general LLMs into ProLLMs. This involves continual learning on protein language data and subsequent instruction tuning. Notably, the training strategy employs Low-Rank Adaptation (LoRA), enhancing scalability and maintaining efficiency by reducing computational overhead during training.
  2. Multi-Task Capability: ProLLaMA excels in multiple protein-related tasks, such as unconditional and controllable protein sequence generation and protein property prediction. It achieves state-of-the-art performance in these tasks, showcasing its ability to handle complex queries and generate proteins with specific desired functionalities based on user instructions.
  3. Numerical Performance: The ProLLaMA model demonstrates strong numerical results. It achieves high scores in terms of pLDDT and TM-score for protein sequence generation, even outperforming existing ProLLMs in metrics indicating structural plausibility and similarity to known protein structures. Similarly, in property prediction tasks, the model achieves nearly perfect accuracy across multiple protein superfamily categories.
  4. Natural Language Integration: By retaining and utilizing its natural language processing capabilities, ProLLaMA effectively handles instruction-driven tasks, which are not feasible with current ProLLMs. This model provides an important bridge between NLP and protein language processing domains, leveraging natural language instructions to extend its applicability.

Implications and Future Directions

ProLLaMA presents significant implications for computational biology and biotechnology, aligning with contemporary needs in drug discovery and synthetic biology. Its enhanced functionality allows researchers to explore protein engineering with higher precision, driven by natural language instructions. The adaptability of ProLLaMA to integrate additional tasks through scalable training frameworks suggests a compelling avenue for further research, potentially facilitating the broader incorporation of AI models in protein science and expedited biotechnological advancements.

Moreover, this research underscores the importance of interdisciplinary strategies in advancing domain-specific LLMs. The methodology sets a precedent for future AI developments in scientific domains, where models are expected to handle diverse tasks seamlessly. Future developments might explore refining the natural language understanding of ProLLMa further, enabling even more complex protein engineering tasks and the seamless integration of additional functional instructions.

In conclusion, ProLLaMA represents a significant step forward for protein LLMs, emphasizing the power of multi-tasking capabilities and efficient resource usage. This research suggests extensive potential for ProLLaMA in practical applications and highlights its contribution to bridging NLP techniques with scientific inquiries in proteomics.

Github Logo Streamline Icon: https://streamlinehq.com