Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gotta be SAFE: A New Framework for Molecular Design (2310.10773v2)

Published 16 Oct 2023 in cs.LG and q-bio.BM

Abstract: Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hopping, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through targeted experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  2. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
  3. Selfies and the future of molecular string representations. Patterns, 3(10), 2022.
  4. Dalke A. O’Boyle N. Deepsmiles: An adaptation of smiles for use in machine-learning of chemical structures. ChemRxiv. Cambridge: Cambridge Open Engage, 2018. doi:https://doi.org/10.26434/chemrxiv.7097960.v1. URL https://chemrxiv.org/engage/chemrxiv/article-details/60c73ed6567dfe7e5fec388d.
  5. Gensmiles: An enhanced validity conscious representation for inverse design of molecules. Knowledge-Based Systems, 268:110429, 2023. ISSN 0950-7051. doi:https://doi.org/10.1016/j.knosys.2023.110429. URL https://www.sciencedirect.com/science/article/pii/S095070512300179X.
  6. Group selfies: a robust fragment-based molecular string representation. Digital Discovery, 2:748–758, 2023. doi:10.1039/D3DD00012E. URL http://dx.doi.org/10.1039/D3DD00012E.
  7. Link-invent: generative linker design with reinforcement learning. Digital Discovery, 2(2):392–408, 2023.
  8. Libinvent: reaction-based generative scaffold decoration for in silico library design. Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021.
  9. Scaffold-constrained molecular generation. Journal of Chemical Information and Modeling, 60(12):5637–5646, 2020. doi:10.1021/acs.jcim.0c01015. URL https://doi.org/10.1021/acs.jcim.0c01015. PMID: 33301333.
  10. Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer. Bioinformatics, 39(1):btac814, 2023.
  11. Smiles-based deep generative scaffold decorator for de-novo drug design. Journal of cheminformatics, 12(1):1–18, 2020.
  12. Molecular representations in ai-driven drug discovery: a review and practical guide. Journal of Cheminformatics, 12(1):1–22, 2020.
  13. Generative models for molecular discovery: Recent advances and challenges. WIREs Computational Molecular Science, 12(5):e1608, 2022. doi:https://doi.org/10.1002/wcms.1608. URL https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1608.
  14. Molgensurvey: A systematic survey in machine learning models for molecule design. 2022.
  15. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
  16. Transformer-based molecular optimization beyond matched molecular pairs. Journal of cheminformatics, 14(1):18, 2022.
  17. Transformer-based generative model accelerating the development of novel braf inhibitors. ACS omega, 6(49):33864–33873, 2021.
  18. Molgpt: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9):2064–2076, 2021.
  19. Junction tree variational autoencoder for molecular graph generation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2323–2332. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/jin18a.html.
  20. Multi-objective molecule generation using interpretable substructures. In International conference on machine learning, pages 4849–4859. PMLR, 2020.
  21. Learning to extend molecular scaffolds with structural motifs. arXiv preprint arXiv:2103.03864, 2021.
  22. Multi-objective de novo drug design with conditional graph generative model. Journal of cheminformatics, 10:1–24, 2018a.
  23. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nature Machine Intelligence, 3(10):914–922, 2021.
  24. Scaffold-based molecular design with a graph generative model. Chemical Science, 11(4):1153–1164, 2020a. doi:10.1039/c9sc04503a. URL https://doi.org/10.1039%2Fc9sc04503a.
  25. Learning deep generative models of graphs. 2018b.
  26. On the art of compiling and using’drug-like’chemical fragment spaces. ChemMedChem: Chemistry Enabling Drug Discovery, 3(10):1503–1507, 2008.
  27. Computationally efficient algorithm to identify matched molecular pairs (mmps) in large data sets. Journal of chemical information and modeling, 50(3):339–348, 2010.
  28. Recap retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. Journal of chemical information and computer sciences, 38(3):511–522, 1998.
  29. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
  30. Unichem: a unified chemical structure cross-referencing and identifier tracking system. Journal of cheminformatics, 5(1):3, 2013.
  31. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science, 5(9):1572–1583, 2019.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Molecular sets (moses): A benchmarking platform for molecular generation models. Frontiers in Pharmacology, 11, 2020. ISSN 1663-9812. doi:10.3389/fphar.2020.565644. URL https://www.frontiersin.org/articles/10.3389/fphar.2020.565644.
  34. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks, 2021.
  35. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1:1–11, 2009.
  36. rdkit/rdkit: 2023_09_2 (q3 2023) release, November 2023. URL https://doi.org/10.5281/zenodo.10099869.
  37. Scaffold-based molecular design with a graph generative model. Chemical science, 11(4):1153–1164, 2020b.
  38. Molecular generative model via retrosynthetically prepared chemical building block assembly. Advanced Science, 10(8):2206674, 2023.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  40. Epidermal growth factor receptor tyrosine kinase inhibitors for central nervous system metastases from non-small cell lung cancer. The Oncologist, 23(10):1199–1209, 2018.
  41. Central nervous system multiparameter optimization desirability: application in drug discovery. ACS chemical neuroscience, 7(6):767–775, 2016.
  42. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. arXiv preprint arXiv:1804.06609, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Emmanuel Noutahi (11 papers)
  2. Cristian Gabellini (4 papers)
  3. Michael Craig (6 papers)
  4. Jonathan S. C Lim (1 paper)
  5. Prudencio Tossou (11 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com