Interactive Molecular Discovery with Natural Language (2306.11976v1)
Abstract: Natural language is expected to be a key medium for various human-machine interactions in the era of LLMs. When it comes to the biochemistry field, a series of tasks around molecules (e.g., property prediction, molecule mining, etc.) are of great significance while having a high technical threshold. Bridging the molecule expressions in natural language and chemical language can not only hugely improve the interpretability and reduce the operation difficulty of these tasks, but also fuse the chemical knowledge scattered in complementary materials for a deeper comprehension of molecules. Based on these benefits, we propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages into it. Several typical solutions including LLMs (e.g., ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement method. Case observations and analysis are conducted to provide directions for further exploration of natural-language interaction in molecular discovery.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 EMNLP-IJCNLP, pages 3615–3620.
- Spatial graph convolutional networks. In International Conference on Neural Information Processing, pages 668–675. Springer.
- Molgensurvey: A systematic survey in machine learning models for molecule design. arXiv preprint arXiv:2203.14500.
- Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280.
- Translation between molecules and natural language. arXiv preprint.
- Deep learning for molecular design—a review of the state of the art. Molecular Systems Design & Engineering, 4(4):828–849.
- Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134.
- Using rule-based labels for weak supervised learning: a chemnet for transferable chemical property prediction. In Proceedings of the 24th ACM SIGKDD, pages 302–310.
- Thomas A Halgren. 1996. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519.
- Pre-trained models: Past, present and future. AI Open, 2:225–250.
- Enhancing clinical bert embedding using a biomedical knowledge base. In 28th International Conference on Computational Linguistics.
- Dual learning for machine translation. Advances in NeurIPS, 29.
- Biomedical event extraction with hierarchical knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1277–1285.
- Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202–D1213.
- Joint biomedical entity and relation extraction with knowledge-enhanced collective inference. arXiv preprint arXiv:2105.13456.
- Greg Landrum et al. 2013. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum.
- Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th ACL-IJCNLP, pages 2592–2607.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- S2orc: The semantic scholar open research corpus. In Proceedings of the 58th ACL, pages 4969–4983.
- Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 63–70.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Scispacy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, pages 311–318.
- Robert G Parr. 1983. Density functional theory. Annual Review of Physical Chemistry, 34(1):631–656.
- Pytorch: An imperative style, high-performance deep learning library. Advances in NeurIPS, 32.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th ICML, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Self-supervised graph transformer on large-scale molecular data. Advances in NeurIPS, 33:12559–12571.
- Get your atoms in order: An open-source implementation of a novel and robust molecular canonicalization algorithm. Journal of chemical information and modeling, 55(10):2111–2120.
- A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint.
- Chemical–protein interaction extraction via gaussian probability distribution and external biomedical knowledge. Bioinformatics, 36(15):4323–4330.
- Taffee T Tanimoto. 1958. Elementary mathematical theory of classification and prediction.
- Yaojun Tong and Lixin Zhang. 2023. Discovering the next decade’s synthetic biology research trends with chatgpt. Synthetic and Systems Biotechnology, 8(2):220.
- Pre-trained language models in biomedical domain: A systematic survey. arXiv preprint arXiv:2110.05006.
- Deep learning approaches for de novo drug design: An overview. Current Opinion in Structural Biology, 72:135–144.
- Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436.
- David Weininger. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint.
- Dual supervised learning. In Proceedings of the 34th ICML, pages 3789–3798.
- Improving biomedical pretrained language models with knowledge. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 180–190.
- A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):1–11.
- Graph neural networks and their current applications in bioinformatics. Frontiers in genetics, 12.
- Zheni Zeng (15 papers)
- Bangchen Yin (3 papers)
- Shipeng Wang (9 papers)
- Jiarui Liu (34 papers)
- Cheng Yang (168 papers)
- Haishen Yao (1 paper)
- Xingzhi Sun (9 papers)
- Maosong Sun (337 papers)
- Guotong Xie (31 papers)
- Zhiyuan Liu (433 papers)