Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leap: molecular synthesisability scoring with intermediates (2403.13005v2)

Published 14 Mar 2024 in q-bio.BM, cs.LG, and physics.chem-ph

Abstract: Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compounds or bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessible intermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates. Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically. Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time. We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. IBM RXN. URL https://rxn.res.ibm.com/.
  2. Daylight theory: Reaction smiles and smirks. URL https://www.daylight.com/meetings/summerschool01/course/basics/smirks.html.
  3. Randomized SMILES strings improve the quality of molecular generative models. Journal of Cheminformatics, 11(1):71, November 2019.
  4. Scscore: Synthetic complexity learned from a reaction corpus. Journal of Chemical Information and Modeling, 58(2):252–261, 2018. doi: 10.1021/acs.jcim.7b00622. URL https://doi.org/10.1021/acs.jcim.7b00622. PMID: 29309147.
  5. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  6. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions, volume 1. June 2009.
  7. Using machine learning to predict suitable conditions for organic reactions. ACS Central Science, 4(11):1465–1476, 2018. doi: 10.1021/acscentsci.8b00357. URL https://doi.org/10.1021/acscentsci.8b00357. PMID: 30555898.
  8. The synthesizability of molecules proposed by generative models. Journal of Chemical Information and Modeling, 60(12):5714–5723, 2020. doi: 10.1021/acs.jcim.0c00174. URL https://doi.org/10.1021/acs.jcim.0c00174. PMID: 32250616.
  9. Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design, 2022.
  10. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1):D1100–D1107, 09 2011. ISSN 0305-1048. doi: 10.1093/nar/gkr777. URL https://doi.org/10.1093/nar/gkr777.
  11. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. Journal of Cheminformatics, 12(1):70, November 2020.
  12. Principles of early drug discovery. Br J Pharmacol, 162(6):1239–1249, March 2011.
  13. PubChem 2023 update. Nucleic Acids Research, 51(D1):D1373–D1380, 10 2022. ISSN 0305-1048. doi: 10.1093/nar/gkac956. URL https://doi.org/10.1093/nar/gkac956.
  14. Metro: Memory-enhanced transformer for retrosynthetic planning via reaction tree, 2023. URL https://openreview.net/forum?id=9JjGZsDvHb.
  15. Daniel Mark Lowe. Extraction of chemical structures and reactions from the literature. 2012. doi: 10.17863/CAM.16293. URL https://www.repository.cam.ac.uk/handle/1810/244727.
  16. Derivatization design of synthetically accessible space for optimization: In silico synthesis vs deep generative design. ACS Med Chem Lett, 12(2):185–194, January 2021.
  17. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  18. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction. ACS Central Science, 5(9):1572–1583, aug 2019. doi: 10.1021/acscentsci.9b00576. URL https://doi.org/10.1021%2Facscentsci.9b00576.
  19. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604—610, March 2018. ISSN 0028-0836. doi: 10.1038/nature25978. URL https://doi.org/10.1038/nature25978.
  20. Retrosynthetic accessibility score (rascore) – rapid machine learned synthesizability classification from ai driven retrosynthetic planning. Chem. Sci., 12:3339–3349, 2021. doi: 10.1039/D0SC05401A. URL http://dx.doi.org/10.1039/D0SC05401A.
  21. Attention is all you need, 2023.
  22. Retroprime: A diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chemical Engineering Journal, 420:129845, 2021. ISSN 1385-8947. doi: https://doi.org/10.1016/j.cej.2021.129845. URL https://www.sciencedirect.com/science/article/pii/S1385894721014303.
  23. Huggingface’s transformers: State-of-the-art natural language processing, 2020.

Summary

  • The paper introduces Leap, a GPT-2–based model that dynamically integrates synthetic routes and intermediate availability to score molecular synthesisability.
  • The paper employs a novel string-based representation of synthetic trees to pre-train and fine-tune the model for accurate synthetic complexity prediction.
  • The paper demonstrates improved performance and adaptability over traditional methods, enhancing compound prioritization in drug discovery.

Novel Approach for Assessing Molecular Synthesisability with Leap: A GPT-2-Based Method

Introduction to Leap

In the context of drug discovery, the synthesisability of a molecule is a crucial factor in determining its viability as a potential drug. Traditional synthesisability scoring methods, such as SAScore, SCScore, and RAScore, play a significant role in guiding computational chemists through the complex landscape of compound selection. However, these methods show limitations, particularly in their incapacity to dynamically incorporate information about available intermediates during the assessment process. The introduction of Leap, a GPT-2 model adept at processing the depth of predicted synthesis routes and incorporating intermediate availability, marks a significant advancement in the field.

Methodological Framework

Leap leverages a transformer-based approach, diverging from traditional scoring by factoring in synthetic trees - hierarchies delineating the synthesis process from target molecule to available compounds. The model is trained with a dual focus: it initially learns to predict the entire retrosynthesis route for a given compound and subsequently fine-tunes this knowledge to estimate the synthetic complexity of a molecule, both with and without considering intermediates. This strategic approach enables Leap to dynamically adjust its synthesisability scores based on the presence of specific intermediates, a capability previously absent in existing scoring methods.

Data Representation and Pre-Training

The successful implementation of Leap is underscored by a novel approach to data representation, translating complex synthetic routes into a structured string format that encapsulates both the sequence and depth of synthetic steps. This representation facilitates the initial pre-training of the model on a vast dataset of synthetic routes, derived from combining and augmenting existing data to include a diverse array of synthesis scenarios. The use of GPT-2's predictive capabilities in this preliminary phase sets a solid foundation for Leap's nuanced understanding of synthetic complexity.

Fine-Tuning for Synthetic Complexity Prediction

Building on its pre-trained knowledge base, Leap undergoes fine-tuning with a regression focus, aiming to accurately predict the depth of synthetic routes, hence the synthetic complexity, for target molecules. This phase is crucial for calibrating Leap's predictions to realistically reflect the synthesisability of compounds in practical drug discovery contexts.

Evaluation and Implications

Leap's performance, rigorously evaluated through a series of experiments, demonstrates a marked improvement in identifying synthesisable molecules, outperforming existing methods by a notable margin. Its unique ability to adjust scores based on the availability of intermediates represents a novel contribution to the field, with profound implications for the efficiency and effectiveness of drug discovery processes.

Insights from Leap's Performance

The robust evaluation process reveals several key insights:

  • Generalization and Adaptability: Leap not only excels in environments closely aligned with its training data but also shows remarkable adaptability to out-of-domain molecules, suggesting a wide range of applicability in real-world drug discovery projects.
  • Dynamic Adjustment: The introduction of intermediates into the scoring process enables a dynamic adjustment of synthetic complexity scores, a significant leap forward in accurately assessing synthesisability.
  • Future Developments: The foundation laid by Leap paves the way for its integration with generative models and opens avenues for exploring more nuanced elements of synthesis planning, such as the role of challenging intermediates.

Conclusion

Leap's introduction represents a significant advancement in the computational assessment of molecular synthesisability. By dynamically incorporating intermediates into its scoring methodology, Leap offers a more nuanced and practical approach to prioritizing compounds in drug discovery. This research not only provides a novel tool for computational chemists but also sets the stage for future developments in the integration of AI with synthetic chemistry, potentially streamlining the path from conceptualization to viable drug compounds.