Can Large Language Models Understand Molecules? (2402.00024v3)
Abstract: Purpose: LLMs like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (LLM Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT
- Learn molecular representations from large-scale unlabeled molecules for drug discovery. arXiv preprint arXiv:2012.11175, 2020.
- Mol2context-vec: learning molecular representation from context awareness for drug discovery. Briefings in Bioinformatics, 22(6):bbab317, 2021.
- Molrope-bert: An enhanced molecular representation with rotary position embedding for molecular property prediction. Journal of Molecular Graphics and Modelling, 118:108344, 2023.
- Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
- Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics, 22(6):bbab152, 2021.
- Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, pages 1–7, 2020.
- Multidti: drug–target interaction prediction based on multi-modal representation learning to bridge the gap between new chemical entities and known heterogeneous network. Bioinformatics, 37(23):4485–4492, 2021.
- Affinity2vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Scientific reports, 12(1):1–18, 2022.
- Embeddti: Enhancing the molecular representations via sequence embedding and graph convolutional network for the prediction of drug-target interaction. Biomolecules, 11(12):1783, 2021.
- Drug-drug interactions prediction based on drug embedding and graph auto-encoder. In 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), pages 547–552. IEEE, 2019.
- Smilegnn: drug–drug interaction prediction based on the smiles and graph neural network. Life, 12(2):319, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Mol2vec: unsupervised machine learning approach with chemical intuition. Journal of chemical information and modeling, 58(1):27–35, 2018.
- Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
- Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230, pages 1–12, 2020.
- What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365, pages 1–24, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649–1655, 2023.
- 14 examples of how llms can transform materials science and chemistry: a reflection on a large language model hackathon. Digital Discovery, 2(5):1233–1250, 2023.
- Is gpt-3 all you need for low-data discovery in chemistry? ChemRxiv, 2023.
- Andrew D White. The future of chemistry is language. Nature Reviews Chemistry, pages 1–2, 2023.
- David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Embedding of molecular structure using molecular hypergraph variational autoencoder with metric learning. Molecular informatics, 40(2):2000203, 2021.
- Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery. In Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics, pages 285–294, 2017.
- Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties. arXiv preprint arXiv:1712.02034, pages 1–8, 2017.
- Harry L Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of chemical documentation, 5(2):107–113, 1965.
- Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015.
- Spvec: a word2vec-inspired feature representation method for drug-target interaction prediction. Frontiers in chemistry, 7:895, 2020.
- Fp2vec: a new molecular featurizer for learning molecular properties. Bioinformatics, 35(23):4979–4985, 2019.
- The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2017.
- The role of chatgpt in data science: how ai-assisted conversational interfaces are revolutionizing the field. Big data and cognitive computing, 7(2):62, 2023.
- OpenAI. Chatgpt [large language model]. https://platform.openai.com/docs, 2023.
- Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
- Predicting drug-drug interactions through large-scale similarity-based link prediction. In The Semantic Web. Latest Advances and New Domains: 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, Proceedings 13, pages 774–789. Springer, 2016.
- Node similarity-based graph convolution for link prediction in biological networks. Bioinformatics, 37(23):4501–4508, 2021.
- Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241–1251, 2020.
- Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry, 6(1):34, 2023.
- Multi-view graph contrastive representation learning for drug-drug interaction prediction. In Proceedings of the Web Conference 2021, pages 2921–2933, 2021.
- Multi-type feature fusion based on graph neural network for drug-drug interaction prediction. BMC bioinformatics, 23(1):224, 2022.
- Dsn-ddi: an accurate and generalized framework for drug–drug interaction prediction by dual-view representation learning. Briefings in Bioinformatics, 24(1):bbac597, 2023.
- Mol-bert: An effective molecular representation with bert for molecular property prediction. Wireless Communications and Mobile Computing, 2021, 2021.
- A bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling, 52(6):1686–1697, 2012.
- National Cancer Institute. AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data, 2017. [Online; accessed 2017-09-27].
- Computational modeling of β𝛽\betaitalic_β-secretase 1 (bace-1) inhibitors using ligand based approaches. Journal of chemical information and modeling, 56(10):1936–1949, 2016.
- A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology, 23(10):1294–1301, 2016.
- Deep Learning for the Life Sciences. O’Reilly Media, 2019. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.
- BioSNAP Datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata, August 2018.
- Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 46(D1):D1074–D1082, 2018.
- Scikit-learn. Machine learning for evolution strategies, pages 45–53, 2016.
- Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
- Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316, 2021.
- Shaghayegh Sadeghi (2 papers)
- Alan Bui (1 paper)
- Ali Forooghi (2 papers)
- Jianguo Lu (7 papers)
- Alioune Ngom (4 papers)