Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training (2403.07920v1)

Published 28 Feb 2024 in q-bio.BM, cs.AI, cs.CL, and cs.LG

Abstract: We propose ProtLLM, a versatile cross-modal LLM for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word LLMing approach to train ProtLLM. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks.

An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

This paper introduces an innovative approach to bridging protein-centric and protein-language tasks using an interleaved protein-language LLM named ProtLLM. The work leverages the computational prowess of LLMs alongside a novel pre-training strategy called protein-as-word modeling and presents a cross-modal architecture that accommodates intricate interleaved inputs combining both protein sequences and natural language.

Model and Pre-training Overview

ProtLLM integrates three primary components: a large autoregressive Transformer LLM, a dedicated protein encoder, and cross-modal connectors. A unique feature of this architecture is the dynamic protein mounting mechanism allowing the processing of sequences interspersed with any number of proteins seamlessly. The authors have chosen LLaMA-7b, a robust LLM, as the foundation model, while ProtST serves as the protein encoder, facilitating the conversion of protein sequences into vector embeddings aligned with natural language representations.

The core of their methodology, the protein-as-word LLMing approach, redefines the prediction task to treat proteins analogously to words. By constructing a protein vocabulary, the model predicts not only natural language tokens but also selects appropriate proteins based on context.

Dataset and Empirical Evaluation

A pivotal contribution of this work is the InterPT dataset, designed to assist in pre-training. This dataset amalgamates structured data such as protein annotations and unstructured sources like biological research papers, enriching the model with biologically pertinent knowledge.

ProtLLM's performance is evaluated against benchmarks in both protein-centric tasks and novel protein-language applications. For classic tasks such as enzyme commission (EC) number prediction, Gene Ontology (GO) term prediction, and protein-protein interaction (PPI) prediction, the model either matches or surpasses established baselines. Notably, it demonstrates an impressive in-context learning capability on PPI tasks, holding promise for applications that operate with limited labeled data.

Results and Implications

The experimental results underscore ProtLLM's capacity to surpass specialized protein representation models, particularly on GO Cellular Component prediction, where it achieves a significant uplift in key performance metrics. The model's design enables effective zero-shot and in-context learning capabilities, expanding the potential application scope considerably.

Practically, this framework could revolutionize tasks like enzyme mining by leveraging text-based function descriptions to retrieve relevant proteins, aligning with real-world scenarios where annotative data is sparse or absent. Theoretical implications suggest a confluence of advancements in representation learning that judiciously blend multimodal data for enhanced biological insights.

Future Directions

This approach opens several avenues for further research. With the successful integration of sequence-level protein understanding, subsequent endeavors could explore modeling higher-order protein structures and their interactions. Further refinement of the protein-text interleaved input mechanism and optimization of the training processes could yield even more efficient and potent models. These advancements could provide researchers with potent tools for scientific discovery in the fields of molecular biology and bioinformatics.

This work showcases a promising step in the confluence of protein modeling and language processing, providing a template for future explorations in multimodal AI applications within scientific domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  3. Kathi Canese and Sarah Weis. 2013. Pubmed: the bibliographic database. The NCBI handbook, 2(1).
  4. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. UniProt Consortium. 2015. Uniprot: a hub for protein information. Nucleic acids research, 43(D1):D204–D212.
  6. Biochemical and molecular characterization of α𝛼\alphaitalic_α-ketoisovalerate decarboxylase, an enzyme involved in the formation of aldehydes from amino acids by lactococcus lactis. FEMS microbiology letters, 238(2):367–374.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225.
  9. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations.
  10. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018.
  11. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168.
  12. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4849–4870, Toronto, Canada. Association for Computational Linguistics.
  13. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. International Conference on Learning Representations.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  15. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
  16. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
  17. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations.
  18. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
  19. Deep neural network based predictions of protein interactions using primary sequences. Molecules, 23(8):1923.
  20. Drugchat: towards enabling chatgpt-like capabilities on drug molecule graphs. arXiv preprint arXiv:2309.03907.
  21. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv.
  22. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  23. Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090.
  24. Molxpt: Wrapping molecules with text for generative pre-training. arXiv preprint arXiv:2305.10688.
  25. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. arXiv preprint arXiv:2310.12798.
  26. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  27. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931.
  28. Integrative genomic mining for enzyme function to enable engineering of a non-natural biosynthetic pathway. Nature communications, 6(1):10005.
  29. Scott McGinnis and Thomas L Madden. 2004. Blast: at the core of a powerful and diverse set of sequence analysis tools. Nucleic acids research, 32(suppl_2):W20–W25.
  30. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv.
  31. String: a database of predicted functional associations between proteins. Nucleic acids research, 31(1):258–261.
  32. A periplasmic aldehyde oxidoreductase represents the first molybdopterin cytosine dinucleotide cofactor containing molybdo-flavoenzyme from escherichia coli. The FEBS journal, 276(10):2762–2774.
  33. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  34. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  35. Large-scale prediction of human protein- protein interactions from amino acid sequence based on latent topic features. Journal of proteome research, 9(10):4992–5001.
  36. A large-scale evaluation of computational protein function prediction. Nature methods, 10(3):221–227.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
  38. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15).
  39. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pages 2023–10.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Oleg Trott and Arthur J Olson. 2010. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461.
  42. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60.
  43. Instructprotein: Aligning human and protein language via knowledge instruction. arXiv preprint arXiv:2310.03269.
  44. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
  45. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
  46. Eurnet: Efficient multi-range relational modeling of protein structure. In ICLR 2023 - Machine Learning for Drug Discovery workshop.
  47. Protst: Multi-modality learning of protein sequences and biomedical texts.
  48. Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems, 35:35156–35173.
  49. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647.
  50. Ontoprotein: Protein pretraining with gene ontology embedding. In International Conference on Learning Representations.
  51. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125.
  52. Protein representation learning by geometric structure pretraining. In The Eleventh International Conference on Learning Representations.
  53. Pre-training protein encoder via siamese sequence-structure diffusion trajectory prediction. In Annual Conference on Neural Information Processing Systems.
  54. Graphtext: Graph reasoning in text space. arXiv preprint arXiv:2310.01089.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Le Zhuo (25 papers)
  2. Zewen Chi (29 papers)
  3. Minghao Xu (25 papers)
  4. Heyan Huang (107 papers)
  5. Heqi Zheng (3 papers)
  6. Conghui He (114 papers)
  7. Xian-Ling Mao (76 papers)
  8. Wentao Zhang (261 papers)
Citations (8)