ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling (2403.12995v4)
Abstract: Protein LLMs have demonstrated significant potential in the field of protein engineering. However, current protein LLMs primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein LLMs for applications involving both proteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom), a novel approach that enables atom-scale and residue-scale unified molecular modeling. ESM-AA achieves this by pre-training on multi-scale code-switch protein sequences and utilizing a multi-scale position encoding to capture relationships among residues and atoms. Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein LLMs. Further investigations reveal that through unified molecular modeling, ESM-AA not only gains molecular knowledge but also retains its understanding of proteins. The source codes of ESM-AA are publicly released at https://github.com/zhengkangjie/ESM-AA.
- AlQuraishi, M. Proteinnet: a standardized data set for machine learning of protein structure. BMC bioinformatics, 20(1):1–10, 2019.
- Anderson, A. C. The process of structure-based drug design. Chemistry & biology, 10(9):787–797, 2003.
- A structure-based drug discovery paradigm. International journal of molecular sciences, 20(11):2783, 2019.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
- Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
- Cross-lingual language model pretraining. Advances in neural information processing systems, 32, 2019.
- Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 34(4):508–519, 1999.
- Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology, 29(11):1046–1051, 2011.
- Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022a.
- Helixfold-single: Msa-free protein structure prediction by using protein language model as an alternative. arXiv preprint arXiv:2207.13921, 2022b.
- Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 3968–3976, 2022c.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Drugclip: Contrasive protein-molecule representation learning for virtual screening. Advances in Neural Information Processing Systems, 36, 2024.
- Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
- Enkie: A package for predicting enzyme kinetic parameter values and their uncertainties. bioRxiv, pp. 2023–03, 2023.
- Multilingual molecular representation learning via contrastive pre-training. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3441–3453, 2022.
- Halgren, T. A. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996.
- Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675, 2022.
- A high-level programming language for generative protein design. bioRxiv, pp. 2022–12, 2022.
- Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019.
- Energy-motivated equivariant pretraining for 3d molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8096–8104, 2023.
- Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
- Few-shot molecular property prediction via hierarchically structured learning on relation graphs. Neural Networks, 163:122–131, 2023.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Toward drug-target interaction prediction via ensemble modeling and transfer learning. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2384–2391. IEEE, 2021.
- Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinformatics, 87(6):520–527, 2019.
- Generalist equivariant transformer towards 3d molecular interaction learning. In NeurIPS 2023 Workshop on New Frontiers of AI for Drug Discovery and Development, 2023.
- Deep learning allows genome-scale prediction of michaelis constants from structural features. PLoS biology, 19(10):e3001402, 2021.
- A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nature Communications, 14(1):2787, 2023a.
- A multimodal transformer network for protein-small molecule interactions enhances drug-target affinity and enzyme-substrate predictions. bioRxiv, pp. 2023–08, 2023b.
- Landrum, G. et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8:31, 2013.
- Learn molecular representations from large-scale unlabeled molecules for drug discovery. arXiv preprint arXiv:2012.11175, 2020.
- An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics, 22(6):bbab109, 2021.
- Universal conditional masked language pre-training for neural machine translation. arXiv preprint arXiv:2203.09210, 2022a.
- Geomgcl: Geometric graph contrastive learning for molecular property prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp. 4541–4549, 2022b.
- Pangu drug model: learn a molecule like a human. bioRxiv, pp. 2022–03, 2022a.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022b.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021.
- Molecular geometry pretraining with se (3)-invariant denoising distance matching. arXiv preprint arXiv:2206.13602, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- One transformer can understand both 2d & 3d molecular data. arXiv preprint arXiv:2210.01765, 2022.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp. 1–8, 2023.
- Protein fitness prediction is impacted by the interplay of language models, ensemble learning, and sampling methods. Pharmaceutics, 15(5):1337, 2023.
- Machine learning in enzyme engineering. ACS Catalysis, 10(2):1210–1223, 2019.
- Critical assessment of methods of protein structure prediction (casp)—round xii. Proteins: Structure, Function, and Bioinformatics, 86:7–15, 2018.
- Graphdta: Predicting drug–target binding affinity with graph neural networks. Bioinformatics, 37(8):1140–1147, 2021a.
- Gefa: early fusion approach in drug-target affinity prediction. IEEE/ACM transactions on computational biology and bioinformatics, 19(2):718–728, 2021b.
- Progen2: exploring the boundaries of protein language models. arXiv preprint arXiv:2206.13517, 2022.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp. 16990–17017. PMLR, 2022.
- Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
- rzmlp-dta: gmlp network with rezero for sequence-based drug-target affinity prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 308–313. IEEE, 2021.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Msa transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021.
- Better informed distance geometry: using what we know to improve conformation generation. Journal of chemical information and modeling, 55(12):2562–2574, 2015.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559–12571, 2020.
- Self-attention based molecule representation for predicting drug-target interaction. In Machine Learning for Healthcare Conference, pp. 230–248. PMLR, 2019.
- 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pp. 20479–20502. PMLR, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Language models generalize beyond natural proteins. bioRxiv, pp. 2022–12, 2022.
- Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pp. 429–436, 2019.
- Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. Journal of Chemical Information and Modeling, 62(11):2713–2725, 2022.
- High-resolution de novo structure prediction from primary sequence. BioRxiv, pp. 2022–07, 2022.
- Moleculenet: a benchmark for molecular machine learning. Chemical science, 9:513–530, 2018.
- Eurnet: Efficient multi-range relational modeling of spatial multi-relational data. arXiv preprint arXiv:2211.12941, 2022.
- X-mol: large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv, pp. 2020–12, 2020.
- Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in bioinformatics, 19(3):482–494, 2018.
- Csp: code-switching pre-training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2624–2636, 2020.
- Mgraphdta: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chemical science, 13(3):816–833, 2022.
- Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480, 2017.
- Fusiondta: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Briefings in Bioinformatics, 23(1):bbab506, 2022.
- Pre-training via denoising for molecular property prediction. arXiv preprint arXiv:2206.00133, 2022.
- Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics, 22(6):bbab152, 2021a.
- Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870–15882, 2021b.
- Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
- Physics-inspired protein encoder pre-training via siamese sequence-structure diffusion trajectory prediction. arXiv preprint arXiv:2301.12068, 2023.
- Structure-informed language models are protein designers. bioRxiv, pp. 2023–02, 2023.
- Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6K2RM6wVqKu.
- Unified 2d and 3d pre-training of molecular representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2626–2636, 2022.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.