Leap: molecular synthesisability scoring with intermediates (2403.13005v2)
Abstract: Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compounds or bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessible intermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates. Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically. Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time. We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound.
- IBM RXN. URL https://rxn.res.ibm.com/.
- Daylight theory: Reaction smiles and smirks. URL https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html.
- Randomized SMILES strings improve the quality of molecular generative models. Journal of Cheminformatics, 11(1):71, November 2019.
- Scscore: Synthetic complexity learned from a reaction corpus. Journal of Chemical Information and Modeling, 58(2):252–261, 2018. doi: 10.1021/acs.jcim.7b00622. URL https://doi.org/10.1021/acs.jcim.7b00622. PMID: 29309147.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions, volume 1. June 2009.
- Using machine learning to predict suitable conditions for organic reactions. ACS Central Science, 4(11):1465–1476, 2018. doi: 10.1021/acscentsci.8b00357. URL https://doi.org/10.1021/acscentsci.8b00357. PMID: 30555898.
- The synthesizability of molecules proposed by generative models. Journal of Chemical Information and Modeling, 60(12):5714–5723, 2020. doi: 10.1021/acs.jcim.0c00174. URL https://doi.org/10.1021/acs.jcim.0c00174. PMID: 32250616.
- Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design, 2022.
- ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1):D1100–D1107, 09 2011. ISSN 0305-1048. doi: 10.1093/nar/gkr777. URL https://doi.org/10.1093/nar/gkr777.
- AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. Journal of Cheminformatics, 12(1):70, November 2020.
- Principles of early drug discovery. Br J Pharmacol, 162(6):1239–1249, March 2011.
- PubChem 2023 update. Nucleic Acids Research, 51(D1):D1373–D1380, 10 2022. ISSN 0305-1048. doi: 10.1093/nar/gkac956. URL https://doi.org/10.1093/nar/gkac956.
- Metro: Memory-enhanced transformer for retrosynthetic planning via reaction tree, 2023. URL https://openreview.net/forum?id=9JjGZsDvHb.
- Daniel Mark Lowe. Extraction of chemical structures and reactions from the literature. 2012. doi: 10.17863/CAM.16293. URL https://www.repository.cam.ac.uk/handle/1810/244727.
- Derivatization design of synthetically accessible space for optimization: In silico synthesis vs deep generative design. ACS Med Chem Lett, 12(2):185–194, January 2021.
- Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction. ACS Central Science, 5(9):1572–1583, aug 2019. doi: 10.1021/acscentsci.9b00576. URL https://doi.org/10.1021%2Facscentsci.9b00576.
- Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604—610, March 2018. ISSN 0028-0836. doi: 10.1038/nature25978. URL https://doi.org/10.1038/nature25978.
- Retrosynthetic accessibility score (rascore) – rapid machine learned synthesizability classification from ai driven retrosynthetic planning. Chem. Sci., 12:3339–3349, 2021. doi: 10.1039/D0SC05401A. URL http://dx.doi.org/10.1039/D0SC05401A.
- Attention is all you need, 2023.
- Retroprime: A diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chemical Engineering Journal, 420:129845, 2021. ISSN 1385-8947. doi: https://doi.org/10.1016/j.cej.2021.129845. URL https://www.sciencedirect.com/science/article/pii/S1385894721014303.
- Huggingface’s transformers: State-of-the-art natural language processing, 2020.