ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training (2403.07920v1)
Abstract: We propose ProtLLM, a versatile cross-modal LLM for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word LLMing approach to train ProtLLM. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
- Kathi Canese and Sarah Weis. 2013. Pubmed: the bibliographic database. The NCBI handbook, 2(1).
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- UniProt Consortium. 2015. Uniprot: a hub for protein information. Nucleic acids research, 43(D1):D204–D212.
- Biochemical and molecular characterization of α𝛼\alphaitalic_α-ketoisovalerate decarboxylase, an enzyme involved in the formation of aldehydes from amino acids by lactococcus lactis. FEMS microbiology letters, 238(2):367–374.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225.
- Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations.
- Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018.
- Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168.
- Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4849–4870, Toronto, Canada. Association for Computational Linguistics.
- Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. International Conference on Learning Representations.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
- Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
- Deep neural network based predictions of protein interactions using primary sequences. Molecules, 23(8):1923.
- Drugchat: towards enabling chatgpt-like capabilities on drug molecule graphs. arXiv preprint arXiv:2309.03907.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090.
- Molxpt: Wrapping molecules with text for generative pre-training. arXiv preprint arXiv:2305.10688.
- Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. arXiv preprint arXiv:2310.12798.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931.
- Integrative genomic mining for enzyme function to enable engineering of a non-natural biosynthetic pathway. Nature communications, 6(1):10005.
- Scott McGinnis and Thomas L Madden. 2004. Blast: at the core of a powerful and diverse set of sequence analysis tools. Nucleic acids research, 32(suppl_2):W20–W25.
- Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv.
- String: a database of predicted functional associations between proteins. Nucleic acids research, 31(1):258–261.
- A periplasmic aldehyde oxidoreductase represents the first molybdopterin cytosine dinucleotide cofactor containing molybdo-flavoenzyme from escherichia coli. The FEBS journal, 276(10):2762–2774.
- OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Large-scale prediction of human protein- protein interactions from amino acid sequence based on latent topic features. Journal of proteome research, 9(10):4992–5001.
- A large-scale evaluation of computational protein function prediction. Nature methods, 10(3):221–227.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15).
- Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pages 2023–10.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Oleg Trott and Arthur J Olson. 2010. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461.
- Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60.
- Instructprotein: Aligning human and protein language via knowledge instruction. arXiv preprint arXiv:2310.03269.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
- Eurnet: Efficient multi-range relational modeling of protein structure. In ICLR 2023 - Machine Learning for Drug Discovery workshop.
- Protst: Multi-modality learning of protein sequences and biomedical texts.
- Peer: a comprehensive and multi-task benchmark for protein sequence understanding. Advances in Neural Information Processing Systems, 35:35156–35173.
- Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647.
- Ontoprotein: Protein pretraining with gene ontology embedding. In International Conference on Learning Representations.
- Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125.
- Protein representation learning by geometric structure pretraining. In The Eleventh International Conference on Learning Representations.
- Pre-training protein encoder via siamese sequence-structure diffusion trajectory prediction. In Annual Conference on Neural Information Processing Systems.
- Graphtext: Graph reasoning in text space. arXiv preprint arXiv:2310.01089.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Le Zhuo (25 papers)
- Zewen Chi (29 papers)
- Minghao Xu (25 papers)
- Heyan Huang (107 papers)
- Heqi Zheng (3 papers)
- Conghui He (114 papers)
- Xian-Ling Mao (76 papers)
- Wentao Zhang (261 papers)