Retrosynthetic reaction prediction using neural sequence-to-sequence models (1706.01643v1)

Published 6 Jun 2017 in cs.LG, q-bio.QM, and stat.ML

Abstract: We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step towards solving the challenging problem of computational retrosynthetic analysis.

Authors (10)

Bowen Liu (63 papers)
Bharath Ramsundar (30 papers)
Prasad Kawthekar (2 papers)
Jade Shi (2 papers)
Joseph Gomes (10 papers)
Quang Luu Nguyen (1 paper)
Stephen Ho (1 paper)
Jack Sloane (1 paper)
Paul Wender (1 paper)
Vijay Pande (13 papers)

Citations (388)

View on Semantic Scholar

Summary

Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models

The paper presents a novel approach to retrosynthetic reaction prediction by leveraging neural sequence-to-sequence (seq2seq) models which are not dependent on rule-based systems. The research is contextualized within computational retrosynthetic analysis, a crucial area in synthetic chemistry that aids chemists in designing viable synthetic pathways to target molecules. Traditional methodologies, such as rule-based expert systems, have inherent limitations in generalizing beyond their knowledge bases and are computationally extensive when employing physical chemistry principles. The proposed approach intends to address these limitations effectively.

Methodology and Model Architecture

The work employs seq2seq models, which are an established framework frequently utilized in tasks such as machine translation, to map sequence inputs representing molecules to output sequences of reactants. This end-to-end learning framework is constructed using Long Short Term Memory (LSTM) cells, combined within a bidirectional architecture. An attentive mechanism is incorporated to align source-target sequences effectively.

A substantial dataset derived from the United States patent literature, encompassing approximately 50,000 reactions categorized into 10 broad reaction types, forms the training basis for this model. Each reaction is represented using Simplified Molecular Input Line Entry System (SMILES) notation, facilitating the translation of line sequence data into molecular predictions.

Strong Numerical Results and Evaluation

The model's performance is assessed against a rule-based expert system acting as a baseline. Interestingly, the seq2seq model demonstrates comparable performance, achieving a top-50 accuracy surpassing the baseline model's maximum accuracy of 69.8%, thereby showcasing its capacity to generalize beyond rule limitations. The paper provides a detailed breakdown of accuracy across different reaction classes, identifying areas where seq2seq outperforms the baseline, particularly in reaction classes involving protections and deprotections which typically involve complex leaving groups not effectively captured by standard reaction rules.

However, limitations are noted in categories such as heteroatom alkylation and acylation, where the model struggles with the prediction variability inherent to these reactions. These insights highlight areas for potential refinement in further research.

Implications and Future Directions

The implications of this research are profound for both practical applications in synthetic chemistry and for further theoretical exploration within AI models for chemical reactions. Practically, enabling models to predict synthetic routes without needing extensive rule-based systems can significantly accelerate drug discovery and novel material synthesis. Theoretically, this research underscores the potential of neural architectures to autonomously learn complex, non-linear scientific principles directly from data.

Future work could enhance this seq2seq model's performance by exploring architectural variants, adjusting preprocessing mechanisms, or implementing novel one-shot learning techniques to adapt and predict reactions even within classes with sparse reaction data. Moreover, integrating external molecular knowledge with neural architectures might mitigate some inaccuracies by providing chemical insights that model learning alone may not capture.

Conclusion

This paper represents a meaningful advancement in the application of machine learning to chemical synthesis, suggesting that seq2seq models hold significant promise in superseding traditional rule-based systems. While further investigation is necessary to enhance accuracy and application scope, particularly for complex reactions, this approach is poised to contribute significantly to computational chemistry and chemical informatics. Such developments can potentially democratize synthetic planning, extending these tools' utility from expert chemists to broader audiences, including those less familiar with intricate synthetic design principles.

PDF Markdown

Related Papers

Find Related Papers