Leap: molecular synthesisability scoring with intermediates

Published 14 Mar 2024 in q-bio.BM, cs.LG, and physics.chem-ph | (2403.13005v2)

Abstract: Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compounds or bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessible intermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates. Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically. Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time. We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound.

Abstract PDF HTML Upgrade to Chat

References (23)

Summary

The paper introduces Leap, a GPT-2–based model that dynamically integrates synthetic routes and intermediate availability to score molecular synthesisability.
The paper employs a novel string-based representation of synthetic trees to pre-train and fine-tune the model for accurate synthetic complexity prediction.
The paper demonstrates improved performance and adaptability over traditional methods, enhancing compound prioritization in drug discovery.

Novel Approach for Assessing Molecular Synthesisability with Leap: A GPT-2-Based Method

Introduction to Leap

In the context of drug discovery, the synthesisability of a molecule is a crucial factor in determining its viability as a potential drug. Traditional synthesisability scoring methods, such as SAScore, SCScore, and RAScore, play a significant role in guiding computational chemists through the complex landscape of compound selection. However, these methods show limitations, particularly in their incapacity to dynamically incorporate information about available intermediates during the assessment process. The introduction of Leap, a GPT-2 model adept at processing the depth of predicted synthesis routes and incorporating intermediate availability, marks a significant advancement in the field.

Methodological Framework

Leap leverages a transformer-based approach, diverging from traditional scoring by factoring in synthetic trees - hierarchies delineating the synthesis process from target molecule to available compounds. The model is trained with a dual focus: it initially learns to predict the entire retrosynthesis route for a given compound and subsequently fine-tunes this knowledge to estimate the synthetic complexity of a molecule, both with and without considering intermediates. This strategic approach enables Leap to dynamically adjust its synthesisability scores based on the presence of specific intermediates, a capability previously absent in existing scoring methods.

Data Representation and Pre-Training

The successful implementation of Leap is underscored by a novel approach to data representation, translating complex synthetic routes into a structured string format that encapsulates both the sequence and depth of synthetic steps. This representation facilitates the initial pre-training of the model on a vast dataset of synthetic routes, derived from combining and augmenting existing data to include a diverse array of synthesis scenarios. The use of GPT-2's predictive capabilities in this preliminary phase sets a solid foundation for Leap's nuanced understanding of synthetic complexity.

Fine-Tuning for Synthetic Complexity Prediction

Building on its pre-trained knowledge base, Leap undergoes fine-tuning with a regression focus, aiming to accurately predict the depth of synthetic routes, hence the synthetic complexity, for target molecules. This phase is crucial for calibrating Leap's predictions to realistically reflect the synthesisability of compounds in practical drug discovery contexts.

Evaluation and Implications

Leap's performance, rigorously evaluated through a series of experiments, demonstrates a marked improvement in identifying synthesisable molecules, outperforming existing methods by a notable margin. Its unique ability to adjust scores based on the availability of intermediates represents a novel contribution to the field, with profound implications for the efficiency and effectiveness of drug discovery processes.

Insights from Leap's Performance

The robust evaluation process reveals several key insights:

Generalization and Adaptability: Leap not only excels in environments closely aligned with its training data but also shows remarkable adaptability to out-of-domain molecules, suggesting a wide range of applicability in real-world drug discovery projects.
Dynamic Adjustment: The introduction of intermediates into the scoring process enables a dynamic adjustment of synthetic complexity scores, a significant leap forward in accurately assessing synthesisability.
Future Developments: The foundation laid by Leap paves the way for its integration with generative models and opens avenues for exploring more nuanced elements of synthesis planning, such as the role of challenging intermediates.

Conclusion

Leap's introduction represents a significant advancement in the computational assessment of molecular synthesisability. By dynamically incorporating intermediates into its scoring methodology, Leap offers a more nuanced and practical approach to prioritizing compounds in drug discovery. This research not only provides a novel tool for computational chemists but also sets the stage for future developments in the integration of AI with synthetic chemistry, potentially streamlining the path from conceptualization to viable drug compounds.

Markdown