Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective (2306.06615v2)
Abstract: Molecule discovery plays a crucial role in various scientific fields, advancing the design of tailored materials and drugs. However, most of the existing methods heavily rely on domain experts, require excessive computational cost, or suffer from sub-optimal performance. On the other hand, LLMs, like ChatGPT, have shown remarkable performance in various cross-modal tasks due to their powerful capabilities in natural language understanding, generalization, and in-context learning (ICL), which provides unprecedented opportunities to advance molecule discovery. Despite several previous works trying to apply LLMs in this task, the lack of domain-specific corpus and difficulties in training specialized LLMs still remain challenges. In this work, we propose a novel LLM-based framework (MolReGPT) for molecule-caption translation, where an In-Context Few-Shot Molecule Learning paradigm is introduced to empower molecule discovery with LLMs like ChatGPT to perform their in-context learning capability without domain-specific pre-training and fine-tuning. MolReGPT leverages the principle of molecular similarity to retrieve similar molecules and their text descriptions from a local database to enable LLMs to learn the task knowledge from context examples. We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation. Experimental results show that compared to fine-tuned models, MolReGPT outperforms MolT5-base and is comparable to MolT5-large without additional training. To the best of our knowledge, MolReGPT is the first work to leverage LLMs via in-context learning in molecule-caption translation for advancing molecule discovery. Our work expands the scope of LLM applications, as well as providing a new paradigm for molecule discovery and design.
- Aizawa, A. An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65, 2003.
- A new cyano (- cn) free molecular design perspective for constructing carbazole-thiophene based environmental friendly organic solar cells. Physica B: Condensed Matter, pp. 414630, 2023.
- Guidelines for recurrent neural network transfer learning-based molecular generation of focused libraries. Journal of Chemical Information and Modeling, 60(12):5699–5713, 2020.
- Anderson, A. C. The process of structure-based drug design. Chemistry & biology, 10(9):787–797, 2003.
- Randomized smiles strings improve the quality of molecular generative models. Journal of cheminformatics, 11(1):1–13, 2019.
- Molgpt: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9):2064–2076, 2021.
- Tallrec: An effective and efficient tuning framework to align large language model with recommendation. arXiv preprint arXiv:2305.00447, 2023.
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Butina, D. Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: A fast and automated way to cluster small and large data sets. Journal of Chemical Information and Computer Sciences, 39(4):747–750, 1999.
- Identifying the kind behind smiles—anatomical therapeutic chemical classification using structure-only representations. Briefings in Bioinformatics, 23(5):bbac346, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Chemberta: Large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Application of deep metric learning to molecular graph similarity. Journal of Cheminformatics, 14(1):1–12, 2022.
- The high-throughput highway to computational materials design. Nature materials, 12(3):191–201, 2013.
- Predictive molecular design and structure–property validation of novel terpene-based, sustainably sourced bacterial biofilm-resistant materials. Biomacromolecules, 2023.
- Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
- Dice, L. R. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
- Selective photoredox trifluoromethylation of tryptophan-containing peptides. European Journal of Organic Chemistry, 2019(46):7596–7605, 2019.
- Molgensurvey: A systematic survey in machine learning models for molecule design. arXiv preprint arXiv:2203.14500, 2022.
- Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 595–607, 2021.
- Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 375–413, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.26.
- Generative diffusion models on graphs: Methods and applications. arXiv preprint arXiv:2302.02591, 2023.
- Neural scaling of deep chemical models. chemrxiv, 2022.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Bidirectional molecule generation with recurrent neural networks. Journal of chemical information and modeling, 60(3):1175–1183, 2020.
- Protein structure-based in-silico approaches to drug discovery: Guide to covid-19 therapeutics. Molecular Aspects of Medicine, 91:101151, 2023.
- A decade of fragment-based drug design: strategic advances and lessons learned. Nature reviews Drug discovery, 6(3):211–219, 2007.
- Material design for next-generation mrna vaccines using lipid nanoparticles. Polymer Reviews, 63(2):394–436, 2023.
- Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019.
- Deep learning methods for small molecule drug discovery: A survey. IEEE Transactions on Artificial Intelligence, 2023.
- Grammar variational autoencoder. In International conference on machine learning, pp. 1945–1954. PMLR, 2017.
- imotor-cnn: Identifying molecular functions of cytoskeleton motor proteins using 2d convolutional neural network via chou’s 5-step rule. Analytical biochemistry, 575:17–26, 2019.
- Chatgpt: A meta-analysis after 2.5 months. arXiv preprint arXiv:2302.13795, 2023.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
- Multi-modal molecule structure-text model for text-based retrieval and editing. arXiv preprint arXiv:2212.10789, 2022.
- Autonomous chemistry enabling environment-adaptive electrochemical energy storage devices. CCS Chemistry, 5(1):11–29, 2023.
- Metaicl: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2791–2809, 2022.
- Covid-19 vaccines: Computational tools and development. Informatics in Medicine Unlocked, pp. 101164, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Bioisosterism: a rational approach in drug design. Chemical reviews, 96(8):3147–3176, 1996.
- Convolutional neural networks for the design and analysis of non-fullerene acceptors. Journal of Chemical Information and Modeling, 59(12):4993–5001, 2019.
- Improving language understanding by generative pre-training. OpenAI, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655–2671, 2022.
- A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
- Thompson, K. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11(6):419–422, 1968.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- The commoditization of ai for molecule design. Artificial Intelligence in the Life Sciences, 2:100031, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nature Machine Intelligence, 3(10):914–922, 2021.
- Improving chemical similarity ensemble approach in target prediction. Journal of cheminformatics, 8:1–10, 2016.
- Advances in molecular design and photophysical engineering of perylene bisimide-containing polyads and multichromophores for film-based fluorescent sensors. The Journal of Physical Chemistry B, 127(4):828–837, 2023.
- Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Late-stage photoredox c–h amidation of n-unprotected indole derivatives: Access to n-(indol-2-yl) amides. Organic Letters, 23(7):2710–2714, 2021.
- White, A. D. The future of chemistry is language. Nature Reviews Chemistry, pp. 1–2, 2023.
- Plasma etching effect on the molecular structure of chitosan-based hydrogels and its biological properties. International Journal of Biological Macromolecules, pp. 123257, 2023.
- Difficulty in learning chirality for transformer fed with smiles. arXiv preprint arXiv:2303.11593, 2023.
- A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):862, 2022.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Jiatong Li (47 papers)
- Yunqing Liu (7 papers)
- Wenqi Fan (78 papers)
- Xiao-Yong Wei (22 papers)
- Hui Liu (481 papers)
- Jiliang Tang (204 papers)
- Qing Li (430 papers)