Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProtChatGPT: Towards Understanding Proteins with Large Language Models (2402.09649v1)

Published 15 Feb 2024 in cs.CE, cs.AI, and q-bio.BM

Abstract: Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent LLMs have made significant strides in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. In this work, we introduce ProtChatGPT, which aims at learning and understanding protein structures via natural languages. ProtChatGPT enables users to upload proteins, ask questions, and engage in interactive conversations to produce comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoders and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM finally combines user questions with projected embeddings to generate informative answers. Experiments show that ProtChatGPT can produce promising responses to proteins and their corresponding questions. We hope that ProtChatGPT could form the basis for further exploration and application in protein research. Code and our pre-trained model will be publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp.  65–72, 2005.
  4. Vl-beit: Generative vision-language pretraining. arXiv preprint arXiv:2206.01127, 2022.
  5. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  6. UNITER: Universal image-text representation learning. In European Conference on Computer Vision (ECCV), 2020. URL https://arxiv.org/pdf/1909.11740.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
  12. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  13. Proteinchat: Towards achieving chatgpt-like functionalities on protein 3d structures. 2023.
  14. Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pp.  8946–8970, 2022.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint, 2021.
  16. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  17. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
  18. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
  19. Conditional antibody design as 3d equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
  20. Align before fuse: Vision and language representation learning with momentum distillation. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900, 2022a.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint, 2019. URL https://arxiv.org/pdf/1908.03557.
  24. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10965–10975, 2022b.
  25. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV), 2020. URL https://arxiv.org/pdf/2004.06165.
  26. Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 workshop on automatic summarization, pp.  45–51, 2002.
  27. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
  30. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp.  1–8, 2023.
  31. Egret: edge aggregated graph attention networks and transfer learning improve protein–protein interaction site prediction. Briefings in Bioinformatics, 23(2):bbab578, 2022.
  32. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  33. Pfam: The protein families database in 2021. Nucleic acids research, 49(D1):D412–D419, 2021.
  34. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp.  16990–17017, 2022.
  35. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  36. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  37. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021a.
  39. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021b.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  41. Transformer protein language models are unsupervised structure learners. Biorxiv, pp.  2020–12, 2020.
  42. Deeprank-gnn: a graph neural network framework to learn patterns in protein–protein interfaces. Bioinformatics, 39(1):btac759, 2023.
  43. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  44. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  46. Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977, 2023.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  49. Bertology meets biology: Interpreting attention in protein language models. arXiv preprint arXiv:2006.15222, 2020.
  50. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
  51. Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix. In International Conference on Machine Learning, pp.  22680–22690. PMLR, 2022.
  52. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint, 2021.
  53. Protst: Multi-modality learning of protein sequences and biomedical texts. arXiv preprint arXiv:2301.12040, 2023.
  54. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  55. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  56. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
  57. Ontoprotein: Protein pretraining with gene ontology embedding. arXiv preprint arXiv:2201.11147, 2022.
  58. VinVL: Revisiting visual representations in vision-language models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  59. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  60. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations, 2023.
  61. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chao Wang (555 papers)
  2. Ruijie Quan (17 papers)
  3. Yi Yang (855 papers)
  4. HeHe Fan (46 papers)
Citations (8)