Application of generative autoencoder in de novo molecular design (1711.07839v1)

Published 21 Nov 2017 in cs.LG and stat.ML

Abstract: A major challenge in computational chemistry is the generation of novel molecular structures with desirable pharmacological and physiochemical properties. In this work, we investigate the potential use of autoencoder, a deep learning methodology, for de novo molecular design. Various generative autoencoders were used to map molecule structures into a continuous latent space and vice versa and their performance as structure generator was assessed. Our results show that the latent space preserves chemical similarity principle and thus can be used for the generation of analogue structures. Furthermore, the latent space created by autoencoders were searched systematically to generate novel compounds with predicted activity against dopamine receptor type 2 and compounds similar to known active compounds not included in the training set were identified.

Authors (5)

Thomas Blaschke (4 papers)
Marcus Olivecrona (2 papers)
Ola Engkvist (19 papers)
Jürgen Bajorath (1 paper)
Hongming Chen (20 papers)

Citations (336)

View on Semantic Scholar

Summary

Application of Generative Autoencoder in De Novo Molecular Design

The paper "Application of Generative Autoencoder in De Novo Molecular Design" presents a comprehensive paper utilizing generative autoencoders (AE) for the design of novel molecular structures. The researchers target a significant challenge in computational chemistry: the generation of novel molecular entities exhibiting desirable pharmacological and physicochemical properties. This paper comprehensively assesses the utilization of autoencoders, a subset of deep learning (DL) methodologies, to map molecular structures into a continuous latent space, which in turn facilitates the generation of novel compounds.

Methodological Approaches

The research employs several neural network architectures, specifically autoencoders (AEs), variational autoencoders (VAEs), and adversarial autoencoders (AAEs), in different configurations to determine their efficacy in molecular design:

Autoencoders (AEs): These are neural network frameworks for unsupervised feature extraction, comprising an encoder to reduce dimensionality and a decoder to reconstruct the original input. This setup aims to minimize information loss during reconstruction.
Variational Autoencoders (VAEs): VAEs introduce a probabilistic element to the AE by ensuring the encoded latent space follows a Gaussian distribution, regularizing the encoder to prevent explicit learning of training data mapping.
Adversarial Autoencoders (AAEs): AAEs incorporate a discriminator trained to distinguish encoded molecules from random samples within a specified distribution. This varies the prior distribution on latent vectors, allowing greater flexibility than the Gaussian assumption in VAEs.

Key Results

Reconstruction Accuracy: The paper finds that models incorporating teacher forcing significantly improve the generation of syntactically valid SMILES, the textual representation of chemical structures, compared to methods without it. Notably, the Uniform AAE architecture achieves the highest percentage of valid SMILES, suggesting an effective handling of syntactic SMILES rules during reconstruction.

2. Latent Space and Chemical Similarity: Results demonstrate that AEs maintain chemical similarity in latent space, suggesting that molecular analogues cluster closely therein. This was evidenced using Celecoxib, where generated structures closely resembled this molecular template.

Target-activity Guided Generation: A Bayesian optimization strategy was effectively applied to the latent space to guide the generation of molecules with predicted activity against dopamine receptor type 2 (DRD2). This denotes the utility of AEs in inverse QSAR settings where desired biological properties guide molecular synthesis.

Implications and Future Directions

The deployment of AEs in drug discovery, as shown in this paper, holds promise for addressing inverse-QSAR problems beyond traditional methods requiring back-mapping of descriptors. By leveraging ML frameworks to learn a non-linear mapping from molecular structures to a latent space, these methods bypass the limitations inherent in traditional model-dependent mappings.

Future work could explore diverse AAE architectures, extending beyond Gaussian and Uniform latent space distributions. This opens up opportunities for synthesizing broader chemical diversity while preserving predictive biological activity. Moreover, integrating additional layers of activity prediction models could enhance the design of potent, selective drug candidates. The novel approach described in this paper could thus serve as a foundation for advanced molecular design systems in pharmaceutical and chemical manufacturing sectors.

PDF Markdown

Related Papers

Find Related Papers