Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Endowing Protein Language Models with Structural Knowledge (2401.14819v1)

Published 26 Jan 2024 in q-bio.QM, cs.LG, and q-bio.BM

Abstract: Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein LLMs have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data and parameter sets limits their flexibility and practicality in real-world scenarios. Concurrently, the recent surge in computationally predicted protein structures unlocks new opportunities in protein representation learning. While promising, the computational burden carried by such complex data still hinders widely-adopted practical applications. To address these limitations, we introduce a novel framework that enhances protein LLMs by integrating protein structural data. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database, using the same masked LLMing objective as traditional protein LLMs. Empirical evaluations of PST demonstrate its superior parameter efficiency relative to protein LLMs, despite being pretrained on a dataset comprising only 542K structures. Notably, PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction. Our findings underscore the potential of integrating structural information into protein LLMs, paving the way for more effective and efficient protein modeling Code and pretrained models are available at https://github.com/BorgwardtLab/PST.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181(4096):223–230, 1973.
  2. Accurate de novo design of membrane-traversing macrocycles. Cell, 185(19):3520–3532, 2022.
  3. A medicinal chemist’s guide to molecular interactions. Journal of medicinal chemistry, 53(14):5061–5084, 2010.
  4. Scope: improvements to the structural classification of proteins–extended database to facilitate variant interpretation and machine learning. Nucleic acids research, 50(D1):D553–D559, 2022.
  5. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023. doi:10.1101/2023.07.05.547496.
  6. Structure-aware transformer for graph representation learning. In International Conference on Machine Learning (ICML), pages 3469–3489, 2022.
  7. Deep convolutional networks for quality assessment of protein folds. Bioinformatics, 34(23):4046–4053, 2018.
  8. Prottrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
  9. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023.
  10. D. M. Fowler and S. Fields. Deep mutational scanning: a new style of protein science. Nature methods, 11(8):801–807, 2014.
  11. Assessment of ligand binding site predictions in casp10. Proteins: Structure, Function, and Bioinformatics, 82:154–163, 2014.
  12. Pifold: Toward effective and efficient protein inverse folding. In International Conference on Learning Representations (ICLR), 2022.
  13. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
  14. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. In International Conference on Learning Representations (ICLR), 2021.
  15. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019.
  16. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, 2018.
  17. Exploring evolution-aware &-free protein language models as protein function predictors. Advances in Neural Information Processing Systems, 35:38873–38884, 2022a.
  18. Exploring evolution-based &-free protein language models as protein function predictors. Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  19. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins: Structure, Function, and Bioinformatics, 77(3):499–508, 2009.
  20. Generative models for graph-based protein design. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
  21. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations (ICLR), 2020.
  22. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  23. J. D. M.-W. C. Kenton and L. K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  24. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nature Communications, 14(1):4139, 2023.
  25. ProteinShake – Building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  26. Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes. Nature Communications, 14(1):4935, 2023a.
  27. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023b.
  28. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems (NeurIPS), 34:29287–29303, 2021.
  29. Pfam: The protein families database in 2021. Nucleic acids research, 49(D1):D412–D419, 2021.
  30. An ensemble 3d deep-learning model to predict protein metal-binding site. Cell Reports Physical Science, 3(9), 2022.
  31. Evaluating Protein Transfer Learning with TAPE. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  32. Alphafold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel cdk20 small molecule inhibitor. Chemical Science, 14(6):1443–1452, 2023.
  33. Deep generative models of genetic variation capture the effects of mutations. Nature methods, 15(10):816–822, 2018.
  34. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), 2021.
  35. Is transfer learning necessary for protein landscape prediction? arXiv preprint arXiv:2011.03443, 2020.
  36. Atom3d: Tasks on molecules in three dimensions. arXiv preprint arXiv:2012.04035, 2020.
  37. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
  38. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  39. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific reports, 12(1):6832, 2022.
  40. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2018.
  41. Pointsite: a point cloud segmentation tool for identification of protein ligand binding atoms. Journal of Chemical Information and Modeling, 62(11):2835–2845, 2022.
  42. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations (ICLR), 2022.
  43. A systematic study of joint representation learning on protein sequences and structures, 2023a.
  44. Enhancing protein language models with structure-based encoder and pre-training. ICLR MLDD, 2023b.
  45. Structure-informed language models are protein designers. In International Conference on Machine Learning (ICML), 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.