SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules (1703.07076v2)

Published 21 Mar 2017 in cs.LG

Abstract: Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.

Citations (281)

View on Semantic Scholar

Summary

The paper proposes using SMILES enumeration, which represents a molecule in multiple ways, as a data augmentation method to significantly expand molecular datasets for training neural networks.
This data augmentation approach substantially improves model performance in QSAR prediction tasks, increasing R^2 from 0.56 to 0.66 and decreasing RMS from 0.62 to 0.55 on the test set.
The findings provide a practical strategy to overcome limited labeled data in cheminformatics, enabling the training of more robust and generalizable QSAR models for applications like drug discovery.

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

The paper proposes an innovative data augmentation procedure leveraging the Simplified Molecular Input Line Entry System (SMILES) format. In the field of computational QSAR (Quantitative Structure-Activity Relationship) modeling, the scarcity of large labeled datasets has been a significant hurdle, limiting the practical applications of deep learning methodologies. This paper explores how SMILES enumeration can be employed to augment data for training neural networks, aiming to enhance model prediction performance.

Summary of Research Methodology and Findings

The core contribution of this paper lies in exploiting the multiple SMILES representations available for a single molecule, using this enumeration to expand a molecular QSAR dataset significantly—by a factor of approximately 130. The paper proceeds to apply this augmented dataset to train a neural network based on long short-term memory (LSTM) cells, known for handling sequential data effectively.

A notable part of the methodology involves constructing two separate QSAR models: one trained on canonical SMILES, ensuring one-to-one correspondence between a molecule and its SMILES string, and another on the augmented dataset with diverse SMILES strings. The canonical model, with its limited examples, struggles to generalize and predict alternative SMILES strings effectively, thus highlighting the importance of data diversity.

The paper quantitatively measures the impact of SMILES enumeration on model performance. The results show a clear improvement when using the augmented dataset, with the correlation coefficient $R^2$ on the test set increasing from 0.56 to 0.66, and the root mean square error (RMS) decreasing from 0.62 to 0.55. Additional enhancement was observed by averaging predictions across different SMILES representations, achieving an $R^2$ of 0.68 and an RMS of 0.52.

Implications and Future Directions

The implications of these findings are twofold—practical and theoretical. Practically, this approach adds a new dimension to data augmentation strategies in cheminformatics, allowing researchers to use smaller labeled datasets with more complex neural network models without overfitting. Data diversity introduced via SMILES enumeration contributes to more robust and generalizable QSAR models, providing researchers with a tool to better tackle the challenge of limited data availability.

Theoretically, the paper raises interesting questions about the role of input diversity in neural network training and its effects on learning molecular representations. The findings suggest potential future research direction, such as integrating SMILES enumeration with other data enhancement techniques to further refine QSAR predictions or exploring their impact in different machine learning contexts, including transfer learning and unsupervised model training.

In conclusion, SMILES enumeration emerges as a compelling avenue for augmenting molecular datasets for neural network applications in QSAR modeling. This methodology not only improves prediction accuracy but also suggests broader applicability across different areas where representing molecular diversity is crucial. As neural networks continue to evolve, such data-driven strategies will likely play a vital role in advancing computational chemistry and drug discovery efforts.

PDF Markdown

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules (1703.07076v2)

Summary

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Summary of Research Methodology and Findings

Implications and Future Directions

Related Papers