GP-MoLFormer: A Foundation Model For Molecular Generation (2405.04912v1)
Abstract: Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical LLMs. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.
- Kim, S. et al. Pubchem substance and compound databases. \JournalTitleNucleic acids research 44, D1202–D1213 (2016).
- Kaplan, J. et al. Scaling laws for neural language models (2020). 2001.08361.
- Beyond neural scaling laws: beating power law scaling via data pruning. \JournalTitleAdvances in Neural Information Processing Systems 35, 19523–19536 (2022).
- Ghorbani, B. et al. Scaling laws for neural machine translation (2021). 2109.07740.
- Memorization without overfitting: Analyzing the training dynamics of large language models (2022). 2205.10770.
- Carlini, N. et al. Quantifying memorization across neural language models (2023). 2202.07646.
- Large language models struggle to learn long-tail knowledge (2023). 2211.08411.
- Lee, K. et al. Deduplicating training data makes language models better (2022). 2107.06499.
- ZINC–a free database of commercially available compounds for virtual screening. \JournalTitleJournal of Chemical Information and Modeling 45, 177–182 (2005).
- Jia, X. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. \JournalTitleNature 573, 251–255 (2019).
- Ross, J. et al. Large-scale chemical language representations capture molecular structure and properties. \JournalTitleNature Machine Intelligence 4, 1256–1264 (2022).
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 5156–5165 (PMLR, 2020).
- Roformer: Enhanced transformer with rotary position embedding. \JournalTitlearXiv preprint arXiv:2104.09864 (2021).
- Polykovskiy, D. et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. \JournalTitlearXiv:1811.12823 (2018).
- Perplexity-based molecule ranking and bias estimation of chemical language models. \JournalTitleJournal of chemical information and modeling 62, 1199–1206 (2022).
- Improved precision and recall metric for assessing generative models. \JournalTitleAdvances in neural information processing systems 32 (2019).
- Carlini, N. et al. Extracting training data from large language models (2021). 2012.07805.
- Frey, N. C. et al. Neural scaling of deep chemical models. \JournalTitleNature Machine Intelligence 5, 1297–1305 (2023).
- Junction tree variational autoencoder for molecular graph generation. \JournalTitlearXiv:1802.04364 (2018).
- Eckmann, P. et al. Limo: Latent inceptionism for targeted molecule generation. \JournalTitleProceedings of machine learning research 162, 5777 (2022).
- Fang, Y. et al. Domain-agnostic molecular generation with chemical feedback (2024). 2301.11259.
- Chemformer: a pre-trained transformer for computational chemistry. \JournalTitleMachine Learning: Science and Technology 3, 015022 (2022).
- Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. \JournalTitleMachine Learning: Science and Technology 1, 045024, DOI: 10.1088/2632-2153/aba947 (2020).
- Skinnider, M. A. Invalid smiles are beneficial rather than detrimental to chemical language models. \JournalTitleNature Machine Intelligence 1–12 (2024).
- Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. \JournalTitleJournal of Chemical Information and Modeling 58, 1736–1741 (2018).
- Arús-Pous, J. et al. Smiles-based deep generative scaffold decorator for de-novo drug design. \JournalTitleJournal of cheminformatics 12, 1–18 (2020).
- The power of scale for parameter-efficient prompt tuning (2021). 2104.08691.
- Matched molecular pair analysis in drug discovery. \JournalTitleDrug Discovery Today 18, 724–731 (2013).
- Yang, Z. et al. Matched molecular pair analysis in drug discovery: methods and recent applications. \JournalTitleJournal of Medicinal Chemistry 66, 4361–4377 (2023).
- Optimization of molecules via deep reinforcement learning. \JournalTitleScientific Reports 9, 10752 (2019).
- Graph convolutional policy network for goal-directed molecular graph generation. In NeurIPS, 6410–6421 (2018).
- Xie, Y. et al. Mars: Markov molecular sampling for multi-objective drug discovery. In International Conference on Learning Representations (2021).
- Graphdf: A discrete flow model for molecular graph generation. In Meila, M. & Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, vol. 139 of Proceedings of Machine Learning Research, 7192–7203 (PMLR, 2021).
- Gargoyles: An open source graph-based molecular optimization method based on deep reinforcement learning. \JournalTitleACS omega 8, 37431–37441 (2023).
- Molecular de-novo design through deep reinforcement learning. \JournalTitleJournal of Cheminformatics 9, 48 (2017).
- Maziarka, Ł. et al. Mol-cyclegan: a generative model for molecular optimization. \JournalTitleJournal of Cheminformatics 12, 2 (2020).
- Vaswani, A. et al. Attention is all you need. \JournalTitleAdvances in neural information processing systems 30 (2017).
- Kim, S. et al. PubChem 2019 update: improved access to chemical data. \JournalTitleNucleic Acids Research (2018).
- Schwaller, P. et al. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction. \JournalTitleACS Central Science 5, 1572–1583, DOI: 10.1021/acscentsci.9b00576 (2019).
- RDKit: Open-source cheminformatics. http://www.rdkit.org (2021). [Online; accessed 28-May-2021].
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACL: HLT, Vol 1 (2019).
- Learning multimodal graph-to-graph translation for molecular optimization (2019). 1812.01070.
- Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. \JournalTitleACS Central Science 4, 268–276 (2018).
- Grammar variational autoencoder. In Proceedings of the 34th ICLR, 1945–1954 (JMLR.org, 2017).
- Syntax-directed variational autoencoder for structured data. \JournalTitlearXiv:1802.08786 (2018).
- Uncorrupt smiles: a novel approach to de novo design. \JournalTitleJournal of Cheminformatics 15, 22 (2023).
- Deepsmiles: an adaptation of smiles for use in machine-learning of chemical structures. \JournalTitleChemArxiv preprint: chemrxiv.7097960.v1 (2018).
- Cheng, A. H. et al. Group selfies: a robust fragment-based molecular string representation. \JournalTitleDigital Discovery 2, 748–758 (2023).
- Chenthamarakshan, V. et al. Cogmol: Target-specific and selective drug design for covid-19 using deep generative models. \JournalTitleAdvances in Neural Information Processing Systems 33, 4320–4332 (2020).
- Hierarchical generation of molecular graphs using structural motifs. In International conference on machine learning, 4839–4848 (PMLR, 2020).
- Molgrow: A graph normalizing flow for hierarchical molecular generation. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 8226–8234 (2021).
- Constrained generation of semantically valid graphs via regularizing variational autoencoders. \JournalTitleAdvances in Neural Information Processing Systems 31 (2018).
- MolGAN: An implicit generative model for small molecular graphs. \JournalTitlearXiv:1805.11973 (2018).
- Augmenting molecular deep generative models with topological data analysis representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3783–3787 (IEEE, 2022).
- Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. \JournalTitleNature Biotechnology 37, 1038––1040 (2019).
- Jerret Ross (11 papers)
- Brian Belgodere (13 papers)
- Samuel C. Hoffman (13 papers)
- Vijil Chenthamarakshan (36 papers)
- Youssef Mroueh (66 papers)
- Payel Das (104 papers)