Text-Guided Molecule Generation with Diffusion Language Model (2402.13040v1)
Abstract: Text-guided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion LLM (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.
- MolGPT: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9): 2064–2076.
- SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3615–3620. Hong Kong, China: Association for Computational Linguistics.
- Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint, abs/2303.12712.
- Uncovering Neural Scaling Laws in Molecular Representation Learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Syntax-directed variational autoencoder for molecule generation. In Proceedings of the international conference on learning representations.
- Translation between molecules and natural language. ArXiv preprint, abs/2204.11817.
- Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 595–607. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Molecular docking and structure-based drug design strategies. Molecules, 20(7): 13384–13421.
- GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30: 681–694.
- Neural scaling of deep chemical models.
- Gage, P. 1994. A new algorithm for data compression. C Users Journal, 12(2): 23–38.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268–276.
- Diffuseq: Sequence to sequence text generation with diffusion models. ArXiv preprint, abs/2210.08933.
- Bidirectional molecule generation with recurrent neural networks. Journal of chemical information and modeling, 60(3): 1175–1183.
- DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design.
- Diffusionbert: Improving generative masked language models with diffusion models. ArXiv preprint, abs/2211.15029.
- Denoising Diffusion Probabilistic Models. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Autoregressive diffusion models. ArXiv preprint, abs/2110.02037.
- Argmax flows and multinomial diffusion: Towards non-autoregressive language models. ArXiv preprint, abs/2102.05379.
- Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1): 015022.
- Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- DiffWave: A Versatile Diffusion Model for Audio Synthesis. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. Online: Association for Computational Linguistics.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35: 4328–4343.
- Multi-modal molecule structure-text model for text-based retrieval and editing. ArXiv preprint, abs/2212.10789.
- A high-throughput framework for determining adsorption energies on solid surfaces. npj Computational Materials, 3(1): 14.
- Artificial intelligence in drug discovery and development. Drug discovery today, 26(1): 80.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551.
- Graph neural networks for materials science and chemistry. Communications Materials, 3(1): 93.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1): 120–131.
- ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11): 2324–2337.
- Attention is All you Need. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 5998–6008.
- Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1): 31–36.
- SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of chemical information and computer sciences, 29(2): 97–101.
- Diffusion models: A comprehensive survey of methods and applications. ArXiv preprint, abs/2209.00796.
- Molecular design of benzodithiophene-based organic photovoltaic materials. Chemical reviews, 116(12): 7397–7457.
- Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. In Bengio, S.; Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, 6412–6422.
- A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1): 862.
- Featurizations matter: a multiview contrastive learning approach to molecular pretraining. In ICML 2022 2nd AI for Science Workshop.
- Haisong Gong (8 papers)
- Qiang Liu (405 papers)
- Shu Wu (109 papers)
- Liang Wang (512 papers)