Saturn: Sample-efficient Generative Molecular Design using Memory Manipulation (2405.17066v1)
Abstract: Generative molecular design for drug discovery has very recently achieved a wave of experimental validation, with language-based backbones being the most common architectures employed. The most important factor for downstream success is whether an in silico oracle is well correlated with the desired end-point. To this end, current methods use cheaper proxy oracles with higher throughput before evaluating the most promising subset with high-fidelity oracles. The ability to directly optimize high-fidelity oracles would greatly enhance generative design and be expected to improve hit rates. However, current models are not efficient enough to consider such a prospect, exemplifying the sample efficiency problem. In this work, we introduce Saturn, which leverages the Augmented Memory algorithm and demonstrates the first application of the Mamba architecture for generative molecular design. We elucidate how experience replay with data augmentation improves sample efficiency and how Mamba synergistically exploits this mechanism. Saturn outperforms 22 models on multi-parameter optimization tasks relevant to drug discovery and may possess sufficient sample efficiency to consider the prospect of directly optimizing high-fidelity oracles.
- Ai-powered therapeutic target discovery. Trends in Pharmacological Sciences, 2023.
- AlphaFold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel CDK20 small molecule inhibitor. Chem. Sci., 14(6):1443–1452, 2023.
- Discovery of novel and selective SIK2 inhibitors by the application of AlphaFold structures and generative models. Bioorg. Med. Chem., 91:117414, 2023.
- Discovery of potent, selective, and orally bioavailable small-molecule inhibitors of CDK8 for the treatment of cancer. J. Med. Chem., 2023.
- Discovery of 3-hydroxymethyl-azetidine derivatives as potent polymerase theta inhibitors. Bioorg. Med. Chem., page 117662, 2024.
- A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol., March 2024. ISSN 1546-1696. doi: 10.1038/s41587-024-02143-0.
- Ism9682a, a novel and potent kif18a inhibitor, shows robust antitumor effects against chromosomally unstable cancers. Cancer Research, 84(6_Supplement):5727–5727, 2024.
- Dockstream: a docking wrapper to enhance de novo molecular design. Journal of cheminformatics, 13:1–21, 2021.
- Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of cheminformatics, 14(1):68, 2022.
- Tacogfn: Target conditioned gflownet for drug design. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop, 2023.
- Hit and lead discovery with explorative rl and fragment-based molecule generation. Advances in Neural Information Processing Systems, 34:7924–7936, 2021.
- Exploring chemical space with score-based out-of-distribution generation. In International Conference on Machine Learning, pages 18872–18892. PMLR, 2023a.
- Drug discovery with dynamic goal-aware fragments. arXiv preprint arXiv:2310.00841, 2023b.
- Reinforced genetic algorithm for structure-based drug design. Advances in Neural Information Processing Systems, 35:12325–12338, 2022.
- Machine learning guided aqfep: A fast & efficient absolute free energy perturbation solution for virtual screening. 2023.
- Protein–ligand binding free energy calculations with fep+. Biomolecular simulations: methods and protocols, pages 201–232, 2019.
- Icolos: a workflow manager for structure-based post-processing of de novo generated small molecules. Bioinformatics, 38(21):4951–4952, 2022.
- Automated relative binding free energy calculations from smiles to δ𝛿\deltaitalic_δδ𝛿\deltaitalic_δg. Communications Chemistry, 6(1):82, 2023.
- Mfbind: a multi-fidelity approach for evaluating drug compounds in practical generative modeling. arXiv preprint arXiv:2402.10387, 2024.
- Sample efficiency matters: a benchmark for practical molecular optimization. Advances in neural information processing systems, 35:21342–21357, 2022.
- Augmented memory: Sample-efficient generative molecular design with reinforcement learning. JACS Au, 2024a.
- Beam enumeration: Probabilistic explainability for sample efficient self-conditioned molecular design. In Proc. 12th International Conference on Learning Representations, 2024b.
- Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9:1–14, 2017.
- Reinvent 2.0: an ai tool for de novo drug design. Journal of chemical information and modeling, 60(12):5918–5922, 2020a.
- Esben Jannik Bjerrum. Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076, 2017.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
- Reinvent 4: Modern ai–driven generative molecule design. Journal of Cheminformatics, 16(1):20, 2024a.
- Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120–131, 2018.
- Exploring deep recurrent models with reinforcement learning for molecule design. In Proc. 6th International Conference on Learning Representations, 2018.
- Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018.
- Faster and more diverse de novo molecular optimization with double-loop reinforcement learning using augmented smiles. Journal of Computer-Aided Molecular Design, 37(8):373–394, 2023.
- Molgpt: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9):2064–2076, 2021.
- cmolgpt: A conditional generative pre-trained transformer for target-specific de novo molecular generation. Molecules, 28(11):4430, 2023.
- Molecular de novo design through transformer-based reinforcement learning. arXiv preprint arXiv:2310.05365, 2023.
- Molecule generation using transformers and policy gradient reinforcement learning. Scientific Reports, 13(1):8799, 2023.
- De novo drug design using reinforcement learning with multiple gpt agents. Advances in Neural Information Processing Systems, 36, 2024.
- Evaluation of reinforcement learning in transformer-based molecular design. 2024.
- Syntalinker: automatic fragment linking with deep conditional transformer neural networks. Chemical science, 11(31):8312–8322, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pages 2323–2332. PMLR, 2018.
- Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature biotechnology, 37(9):1038–1040, 2019.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget, 8(7):10883, 2017.
- Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
- Optimizing distributions over molecular space. an objective-reinforced generative adversarial network for inverse-design chemistry (organic). 2017.
- Reinforced adversarial neural computer for de novo molecular design. Journal of chemical information and modeling, 58(6):1194–1204, 2018.
- Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
- Chemistry42: an ai-driven platform for molecular design and optimization. Journal of Chemical Information and Modeling, 63(3):695–701, 2023.
- Graph convolutional policy network for goal-directed molecular graph generation. In Advances in neural information processing systems. NeurIPS, 2018.
- Multi-objective molecule generation using interpretable substructures. In International conference on machine learning, pages 4849–4859. PMLR, 2020a.
- Graph networks for molecular design. Machine Learning: Science and Technology, 2(2):025023, 2021.
- De novo drug design using reinforcement learning with graph-based deep generative models. Journal of Chemical Information and Modeling, 62(20):4863–4872, 2022.
- Learning to extend molecular scaffolds with structural motifs. In Proc. 10th International Conference on Learning Representations, 2022.
- DiGress: Discrete denoising diffusion for graph generation. In Proc. 11th International Conference on Learning Representations, 2023.
- Gflownet foundations. Journal of Machine Learning Research, 24(210):1–55, 2023.
- Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381–27394, 2021.
- Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
- Jan H Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
- Equivariant 3d-conditional diffusion model for molecular linker design. Nature Machine Intelligence, pages 1–11, 2024.
- Flexible structure-based design of small molecules with equivariant diffusion models. In PROTEIN SCIENCE, volume 32. WILEY 111 RIVER ST, HOBOKEN 07030-5774, NJ USA, 2023.
- Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
- Selfies and the future of molecular string representations. Patterns, 3(10), 2022.
- Michael A Skinnider. Invalid smiles are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, pages 1–12, 2024.
- Chemical language modeling with structured state spaces. 2024.
- Guacamol: benchmarking models for de novo molecular design. Journal of chemical information and modeling, 59(3):1096–1108, 2019.
- Molecular sets (moses): a benchmarking platform for molecular generation models. Frontiers in pharmacology, 11:565644, 2020.
- Randomized smiles strings improve the quality of molecular generative models. Journal of cheminformatics, 11:1–13, 2019.
- Generative molecular design in low data regimes. Nature Machine Intelligence, 2(3):171–180, 2020.
- Chemical language models enable navigation in sparsely populated chemical space. Nature Machine Intelligence, 3(9):759–770, 2021.
- De novo design of nurr1 agonists via fragment-augmented generative deep learning in low-data regime. Journal of Medicinal Chemistry, 66(12):8170–8177, 2023.
- Libinvent: reaction-based generative scaffold decoration for in silico library design. Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021.
- Memory-assisted reinforcement learning for diverse molecular de novo design. Journal of cheminformatics, 12(1):68, 2020b.
- The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887–2893, 1996.
- Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
- Drugex v2: de novo design of drug molecules by pareto-based multi-objective reinforcement learning in polypharmacology. Journal of cheminformatics, 13(1):85, 2021.
- Improving de novo molecular design with curriculum learning. Nature Machine Intelligence, 4(6):555–563, 2022.
- Roughness of molecular property landscapes and its impact on modellability. Journal of Chemical Information and Modeling, 62(19):4660–4671, 2022.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Structure of the d2 dopamine receptor bound to the atypical antipsychotic drug risperidone. Nature, 555(7695):269–273, 2018.
- Structure of acetylcholinesterase complexed with e2020 (aricept®): implications for the design of new anti-alzheimer drugs. Structure, 7(3):297–307, 1999.
- 2, 4-diaminopyrimidine mk2 inhibitors. part i: observation of an unexpected inhibitor binding mode. Bioorganic & medicinal chemistry letters, 20(1):330–333, 2010.
- Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90–98, 2012.
- Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010.
- Zinc 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11):2324–2337, 2015.
- Fast, accurate, and reliable molecular docking with quickvina 2. Bioinformatics, 31(13):2214–2216, 2015.
- Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1:1–11, 2009.
- How much space has been explored? measuring the chemical space covered by databases and machine-generated molecules. In Proc. 11th International Conference on Learning Representations, 2023.
- Graphaf: a flow-based autoregressive model for molecular graph generation. In Proc. 8th International Conference on Learning Representations, 2020.
- Autonomous molecule generation using reinforcement learning and docking to develop potential novel inhibitors. Scientific reports, 10(1):22104, 2020.
- Hierarchical generation of molecular graphs using structural motifs. In International conference on machine learning, pages 4839–4848. PMLR, 2020b.
- Graphdf: A discrete flow model for molecular graph generation. In International conference on machine learning, pages 7192–7203. PMLR, 2021.
- Limo: Latent inceptionism for targeted molecule generation. In International conference on machine learning. PMLR, 2022.
- Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pages 10362–10383. PMLR, 2022.
- Augmenting genetic algorithms with deep neural networks for exploring the chemical space. In Proc. 8th International Conference on Learning Representations, 2020.
- Mars: Markov molecular sampling for multi-objective drug discovery. In Proc. 9th International Conference on Learning Representations, 2021.
- Guiding deep molecular optimization with genetic exploration. volume 33, pages 12008–12021, 2020.
- Molecule generation by principal subgraph mining and assembling. Advances in Neural Information Processing Systems, 35:2550–2563, 2022.
- Optimal molecular design: Generative active learning combining reinvent with absolute binding free energy simulations. 2024b.
- Sample efficient reinforcement learning with active learning for molecular design. Chemical Science, 15(11):4146–4160, 2024.
- Link-invent: generative linker design with reinforcement learning. Digital Discovery, 2(2):392–408, 2023.
- Molecular optimization by capturing chemist’s intuition using deep neural networks. Journal of cheminformatics, 13:1–17, 2021.
- Exhaustive local chemical space exploration using a transformer model. 2023.
- Language models can learn complex molecular distributions. Nature Communications, 13(1):3293, 2022.
- Genetic algorithms are strong baselines for molecule generation. arXiv preprint arXiv:2310.09267, 2023.
- Importance resampling for off-policy prediction. Advances in Neural Information Processing Systems, 32, 2019.
- Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. Journal of computer-aided molecular design, 27:221–234, 2013.
- Opls3e: Extending force field coverage for drug-like small molecules. Journal of chemical theory and computation, 15(3):1863–1874, 2019.
- Uff, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American chemical society, 114(25):10024–10035, 1992.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.