ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing (2402.16445v2)

Published 26 Feb 2024 in cs.CE and q-bio.BM

Abstract: LLMs have achieved remarkable performance in multiple NLP tasks. Under the premise that protein sequences constitute the protein language, Protein LLMs(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{https://github.com/PKU-YuanGroup/ProLLaMA} and \url{https://huggingface.co/GreatCaptainNemo}.

PDF HTML Abstract

Overview of ProLLaMA: A Multi-Task Protein LLM

The paper introduces ProLLaMA, an innovative protein LLM (ProLLM) designed for multi-task protein language processing. Unlike traditional ProLLMs which primarily focus on single tasks, typically de novo protein sequence generation, ProLLaMA addresses a broader spectrum of tasks by incorporating a training framework that extends a general LLM's (LLM) capabilities to protein sequences. This approach leverages advancements made in NLP LLMs to the protein language domain, resolving inherent limitations such as the lack of multi-task capabilities and insufficient understanding of natural language instructions.

Key Contributions and Results

The paper outlines the architectural and methodological advancements in the ProLLaMA model, which are crucial for its multi-task proficiency:

Training Framework: The authors propose a two-stage training framework to transform general LLMs into ProLLMs. This involves continual learning on protein language data and subsequent instruction tuning. Notably, the training strategy employs Low-Rank Adaptation (LoRA), enhancing scalability and maintaining efficiency by reducing computational overhead during training.
Multi-Task Capability: ProLLaMA excels in multiple protein-related tasks, such as unconditional and controllable protein sequence generation and protein property prediction. It achieves state-of-the-art performance in these tasks, showcasing its ability to handle complex queries and generate proteins with specific desired functionalities based on user instructions.
Numerical Performance: The ProLLaMA model demonstrates strong numerical results. It achieves high scores in terms of pLDDT and TM-score for protein sequence generation, even outperforming existing ProLLMs in metrics indicating structural plausibility and similarity to known protein structures. Similarly, in property prediction tasks, the model achieves nearly perfect accuracy across multiple protein superfamily categories.
Natural Language Integration: By retaining and utilizing its natural language processing capabilities, ProLLaMA effectively handles instruction-driven tasks, which are not feasible with current ProLLMs. This model provides an important bridge between NLP and protein language processing domains, leveraging natural language instructions to extend its applicability.

Implications and Future Directions

ProLLaMA presents significant implications for computational biology and biotechnology, aligning with contemporary needs in drug discovery and synthetic biology. Its enhanced functionality allows researchers to explore protein engineering with higher precision, driven by natural language instructions. The adaptability of ProLLaMA to integrate additional tasks through scalable training frameworks suggests a compelling avenue for further research, potentially facilitating the broader incorporation of AI models in protein science and expedited biotechnological advancements.

Moreover, this research underscores the importance of interdisciplinary strategies in advancing domain-specific LLMs. The methodology sets a precedent for future AI developments in scientific domains, where models are expected to handle diverse tasks seamlessly. Future developments might explore refining the natural language understanding of ProLLMa further, enabling even more complex protein engineering tasks and the seamless integration of additional functional instructions.

In conclusion, ProLLaMA represents a significant step forward for protein LLMs, emphasizing the power of multi-tasking capabilities and efficient resource usage. This research suggests extensive potential for ProLLaMA in practical applications and highlights its contribution to bridging NLP techniques with scientific inquiries in proteomics.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (8)

Liuzhenghao Lv (7 papers)
Zongying Lin (4 papers)
Hao Li (803 papers)
Yuyang Liu (27 papers)
Jiaxi Cui (13 papers)
Calvin Yu-Chian Chen (4 papers)
Li Yuan (141 papers)
Yonghong Tian (184 papers)

Citations (16)

View on Semantic Scholar

GitHub

GitHub - PKU-YuanGroup/ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing (183 stars)

Tweets

https://twitter.com/btnaughton/status/1863729782636286038

https://twitter.com/LeoTZ03/status/1767735586809684001