ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts (2301.12040v2)
Abstract: Current protein LLMs (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
- Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
- The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic acids research, 28(1):45–48, 2000.
- Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
- Correlation between gene expression profiles and protein–protein interactions within and across genomes. Bioinformatics, 21(11):2730–2738, 2005.
- Using deep learning to annotate the protein universe. BioRxiv, pp. 626507, 2019.
- Low-n protein engineering with data-efficient deep learning. Nature methods, 18(4):389–396, 2021.
- Pubmed: the bibliographic database. The NCBI handbook, 2(1), 2013.
- The low polarity of many membrane proteins. Proceedings of the National Academy of Sciences, 69(4):930–932, 1972.
- Brenda, the elixir core data resource in 2021: new developments and updates. Nucleic Acids Research, 49(D1):D498–D508, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Splat: Speech-language joint pre-training for spoken language understanding. arXiv preprint arXiv:2010.02295, 2020.
- Consortium, U. Uniprot: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1):D506–D515, 2019.
- Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021.
- Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, 2021.
- Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225, 2020.
- Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
- Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2):184–192, 2020.
- Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):1–14, 2021.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. arXiv preprint arXiv:2007.06252, 2020.
- Probing biomedical embeddings from language models. arXiv preprint arXiv:1904.02181, 2019.
- Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Factors enhancing protein thermostability. Protein engineering, 13(3):179–191, 2000.
- Gemme: a simple and fast global epistatic model predicting mutational effects. Molecular biology and evolution, 36(11):2604–2619, 2019.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Multi-modal molecule structure-text model for text-based retrieval and editing. arXiv preprint arXiv:2212.10789, 2022.
- Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, 2020.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Progen: Language modeling for protein generation. arXiv preprint arXiv:2004.03497, 2020.
- Embeddings from protein language models predict conservation and variant effects. Human genetics, 141(10):1629–1647, 2022.
- Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, 2021.
- Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536–540, 1995.
- Progen2: exploring the boundaries of protein language models. arXiv preprint arXiv:2206.13517, 2022.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp. 16990–17017. PMLR, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Protein and amino acid requirements in human nutrition, volume 935. World Health Organization, 2007.
- Speech-language pre-training for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7458–7462. IEEE, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), 2021.
- Is transfer learning necessary for protein landscape prediction? arXiv preprint arXiv:2011.03443, 2020.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650, 2022.
- Clustering huge protein sequence sets in linear time. Nature communications, 9(1):1–8, 2018.
- Fast end-to-end learning on protein surfaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15272–15281, 2021.
- Teague, S. J. Implications of protein flexibility for drug discovery. Nature reviews Drug discovery, 2(7):527–541, 2003.
- Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010.
- Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Eurnet: Efficient multi-range relational modeling of spatial multi-relational data. arXiv preprint arXiv:2211.12941, 2022a.
- Peer: A comprehensive and multi-task benchmark for protein sequence understanding. arXiv preprint arXiv:2206.02096, 2022b.
- Ontoprotein: Protein pretraining with gene ontology embedding. arXiv preprint arXiv:2201.11147, 2022a.
- Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022b.
- Physics-inspired protein encoder pre-training via siamese sequence-structure diffusion trajectory prediction. arXiv preprint arXiv:2301.12068, 2023.
- Torchdrug: A powerful and flexible machine learning platform for drug discovery. arXiv preprint arXiv:2202.08320, 2022.
- Minghao Xu (25 papers)
- Xinyu Yuan (11 papers)
- Santiago Miret (36 papers)
- Jian Tang (326 papers)