Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning

Published 15 Aug 2024 in physics.chem-ph and cs.LG | (2408.08284v1)

Abstract: Rapid determination of molecular structures can greatly accelerate workflows across many chemical disciplines. However, elucidating structure using only one-dimensional (1D) NMR spectra, the most readily accessible data, remains an extremely challenging problem because of the combinatorial explosion of the number of possible molecules as the number of constituent atoms is increased. Here, we introduce a multitask machine learning framework that predicts the molecular structure (formula and connectivity) of an unknown compound solely based on its 1D 1H and/or 13C NMR spectra. First, we show how a transformer architecture can be constructed to efficiently solve the task, traditionally performed by chemists, of assembling large numbers of molecular fragments into molecular structures. Integrating this capability with a convolutional neural network (CNN), we build an end-to-end model for predicting structure from spectra that is fast and accurate. We demonstrate the effectiveness of this framework on molecules with up to 19 heavy (non-hydrogen) atoms, a size for which there are trillions of possible structures. Without relying on any prior chemical knowledge such as the molecular formula, we show that our approach predicts the exact molecule 69.6% of the time within the first 15 predictions, reducing the search space by up to 11 orders of magnitude.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a transformer-based model that achieves 93.2% accuracy in predicting molecular connectivity from substructure data.
It integrates CNNs with the pretrained transformer to deliver 69.6% accurate structure predictions, reducing the search space by over 11 orders of magnitude.
The framework operates directly from raw 1H and 13C NMR spectra, offering a scalable tool for automated chemical analysis.

Accurate and Efficient Structure Elucidation from Routine One-Dimensional NMR Spectra Using Multitask Machine Learning

The paper presents a sophisticated multitask ML framework designed to determine molecular structures and substructures from one-dimensional (1D) Nuclear Magnetic Resonance (NMR) spectra, specifically \textsuperscript{1}H and \textsuperscript{13}C NMR. This framework leverages the capabilities of transformers and convolutional neural networks (CNNs) to perform accurate and rapid structure elucidation, addressing a pivotal challenge in chemical research.

The primary contribution of this research lies in the introduction of a transformer-based ML model capable of predicting the molecular structure (formula and connectivity) directly from raw 1D NMR spectra. Notably, the model operates without relying on any prior chemical knowledge such as molecular formulas or fragments, a significant departure from traditional methods.

Methodology

The research consists of two interconnected tasks: substructure-to-structure prediction and spectrum-to-structure prediction. Initially, the authors pretrain a transformer model on a substructure-to-structure task to predict molecular connectivity from substructure data. This model achieves an accuracy of 93.2% within the first 15 predictions for molecules with up to 19 heavy atoms.

The pretrained transformer is then integrated into a multitask model that combines CNNs to handle the spectral data. This end-to-end model inputs raw 1D \textsuperscript{1}H and \textsuperscript{13}C NMR spectra and simultaneously predicts the presence of substructures and the full molecular structure. The multitask framework demonstrates a structure prediction accuracy of 69.6% within the first 15 predictions, significantly reducing the chemical search space—by over 11 orders of magnitude for systems with up to 19 heavy atoms.

Results and Performance Analysis

The study provides detailed performance metrics for their ML models. When trained on combined \textsuperscript{1}H and \textsuperscript{13}C NMR spectra, the multitask model demonstrated enhanced accuracy compared to when it was trained on only one of these spectra. Specifically:

Combined Spectra (with pretrained transformer): 69.6% accuracy
\textsuperscript{1}H NMR Spectra Only: 59.6% accuracy
\textsuperscript{13}C NMR Spectra Only: 22.0% accuracy

Moreover, pretraining the transformer significantly improved the multitask model’s performance, with a comparison indicating a drop to 53.3% accuracy when initialized with randomly initialized weights.

From the substructure prediction perspective, the multitask model achieved high precision, with a true positive rate of 96.3% for substructures predicted with high probability (>0.9) and a true negative rate of 99.8% for low probability predictions (<0.1). These findings underscore the capacity of the ML framework to handle complex chemical spaces efficiently.

Implications and Future Work

The implications of this research are profound, both theoretically and practically. On a theoretical level, the approach showcases the potential for transformers in substructure and structure elucidation tasks—a role traditionally dominated by chemists and heuristic algorithms. Practically, the tool offers a scalable solution to the combinatorial problem of molecular structure elucidation, which will be particularly beneficial for large-scale chemical databases and automated synthetic workflows.

Future developments could extend this framework to larger molecules and more diverse chemical species by expanding the set of substructures and incorporating additional elements and stereochemistry predictions. Enhancing the model’s training dataset with experimental NMR spectra from varied conditions could further elevate its robustness and accuracy.

In summary, this paper presents a well-engineered and impactful machine learning framework for molecular structure elucidation from routine NMR spectra, marking a significant step forward in the integration of AI into chemical research. The research offers a promising pathway for the development of rapid, accurate, and unsupervised chemical analysis tools, potentially transforming workflows in chemical discovery and education.

Markdown