Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases (2412.05269v1)

Published 6 Dec 2024 in cs.LG, cs.AI, and q-bio.QM

Abstract: Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules, and prevents fully leveraging generative AI for molecular inverse design. While early work has shown that ML-based retrosynthesis models can predict reasonable routes, their low accuracy for less frequent, yet important reactions has been pointed out. As multi-step search algorithms are limited to reactions suggested by the underlying model, the applicability of those tools is inherently constrained by the accuracy of retrosynthesis prediction. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models that combine predictions from diverse sources with complementary inductive biases using a learning-based ensembling strategy. We instantiate the framework with two newly developed models, which already by themselves achieve state of the art in their categories. Through experiments across several orders of magnitude in data scale and time-splits, we show Chimera outperforms all major models by a large margin, owing both to the good individual performance of its constituents, but also to the scalability of our ensembling strategy. Moreover, we find that PhD-level organic chemists prefer predictions from Chimera over baselines in terms of quality. Finally, we transfer the largest-scale checkpoint to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our framework unlocks, we anticipate further acceleration in the development of even more accurate models.

Summary

The paper introduces Chimera, an ensemble framework that significantly improves retrosynthesis prediction accuracy by combining diverse models with different inductive biases.
Experimental results show Chimera sets new benchmarks in top-k prediction accuracy across multiple datasets, including robust performance on novel reaction types.
Chimera's ability to generalize to unseen reactions enhances its practical utility for accelerating drug discovery and materials science applications.

Chimera: Enhancing Retrosynthesis Prediction through Model Ensembling

The paper "Chimera: Accurate Retrosynthesis Prediction by Ensembling Models with Diverse Inductive Biases" tackles the longstanding issue of retrospective chemical synthesis planning, a task central to the development of pharmaceuticals and materials. The work introduces Chimera, a meta-framework that employs ensemble learning to advance the accuracy of retrosynthesis prediction by harmonizing the strengths of models with different inductive biases.

Framework and Methodology

Chimera's core innovation lies in its ensembling approach, which effectively aggregates the predictive outputs from diverse models that individually achieve state-of-the-art performance in their respective categories. The ensemble consists of two newly developed models: one is a molecule-editing model, named NeuralLoc, and the other is a de-novo generative model, R-SMILES 2. NeuralLoc utilizes a graph neural network (GNN) to handle template classification effectively, followed by a novel localization mechanism to ensure precise application of these templates. R-SMILES 2 builds upon the transformer architecture with innovative design adjustments such as grouped query attention and RMS normalization to enhance efficiency and scalability.

In the ensemble, retrosynthetic predictions are ranked using a learning-to-rank strategy that optimizes the ordering of reactant candidates across outputs. This offers the ensemble the ability to draw from the inherent diversity of its constituent models, characterizing it with a high top-k prediction accuracy.

Experimental Evaluation

The ensemble's efficacy was evaluated across multiple datasets, including public benchmarks (USPTO-50K and USPTO-FULL) and a proprietary large-scale dataset, Pistachio. The results demonstrate Chimera's enhanced accuracy over existing models, particularly for top-k predictions where it consistently set new performance benchmarks. Notably, Chimera outperformed all existing methods for top-10 predictions on both datasets, highlighting its robustness in handling both common and rare reaction types. A striking feature of the evaluation includes Chimera's performance on genuinely novel reactions—those with limited precedent in training data—which has traditionally been a major bottleneck for ML-based synthesis planners.

Implications and Future Directions

Chimera’s introduction of model ensembling in the retrosynthesis domain addresses the challenge of balancing accuracy and reaction diversity, providing a pathway towards more precise chemical synthesis predictions. The ability of Chimera to generalize to unseen reaction classes also enhances its utility in real-world settings, as demonstrated by its performance on unseen data from a major pharmaceutical company without additional fine-tuning.

This approach lays the groundwork for further exploration of ensemble-based methodologies in chemistry and similar domains. Future developments can build on Chimera’s framework by incorporating additional models into the ensemble or refining the component models to capture more nuanced reaction dynamics. The sustained performance improvements observed suggest the potential for accelerated drug discovery and material science applications.

In conclusion, Chimera represents a notable advancement in computational retrosynthesis, providing a robust framework that effectively leverages the combined strengths of divergent model architectures. The paper's findings not only advance the field's understanding of ensemble predictions in chemical synthesis but also highlight the potent role of machine learning strategies in addressing complex scientific challenges.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MaziarzKris/status/1866956742212698162

https://twitter.com/MaziarzKris/status/1866240716352974852

https://twitter.com/XTXI/status/1866044259192766602