Automatic chemical design using a data-driven continuous representation of molecules (1610.02415v3)

Published 7 Oct 2016 in cs.LG and physics.chem-ph

Abstract: We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in the set of molecules with fewer that nine heavy atoms.

Authors (10)

Jennifer N. Wei (6 papers)
David Duvenaud (65 papers)
José Miguel Hernández-Lobato (151 papers)
Dennis Sheberla (1 paper)
Jorge Aguilera-Iparraguirre (2 papers)
Timothy D. Hirzel (1 paper)
Ryan P. Adams (74 papers)
Alán Aspuru-Guzik (227 papers)
Rafael Gómez-Bombarelli (34 papers)
Benjamín Sánchez-Lengeling (1 paper)

Citations (2,733)

View on Semantic Scholar

Summary

Overview of "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules"

Abstract and Introduction

The paper "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules" explores an innovative methodology aimed at addressing the considerable challenges of molecular design. Specifically, the authors propose converting discrete molecule representations into a continuous multidimensional format, thereby facilitating the exploration and optimization of vast chemical compound spaces. By leveraging a deep neural network trained on a substantial dataset of existing chemical structures, the model creates a multidimensional continuous representation, enabling automatic generation of novel chemical structures. This process is further refined by incorporating property predictors to guide the search for optimized functional compounds.

Methodology

The authors utilize a three-fold deep neural network setup consisting of an encoder, a decoder, and a predictor. The encoder converts discrete molecular representations into continuous vectors, the decoder reverses this process, and the predictor estimates chemical properties based on the continuous vectors. This framework is designed to harness continuous representations for generating new molecules by decoding random vectors, perturbing known structures, and interpolating between different molecules.

The sequential approach is detailed as follows:

Encoding and Decoding: Utilizing a SMILES representation, molecules are encoded into continuous vectors. The decoder then converts these vectors back to discrete molecular forms.
Property Prediction: By incorporating a predictor into the network, the model estimates the properties of the molecules from their latent continuous representations.
Optimization and Generation: Leveraging gradient-based optimization in the continuous space facilitates an efficient search for molecules with desirable properties.

Experimental Results

The paper presents robust experimental results encompassing two primary datasets: ZINC (drug-like molecules) and QM9 (molecules with fewer than nine heavy atoms). The authors provide an extensive analysis of the fidelity and structural integrity of the generated molecules, indicating a high percentage of valid outputs. Importantly, the environmental distribution of latent space was shown to retain relevant chemical property correlations, thus ensuring meaningful exploration of chemical spaces.

The results reveal the model's capability to generate chemically sound and novel molecules reflective of the training data's statistics, invalidating concerns around "virtually invalid" outputs. For instance, in assessments of logP and SAS properties, the continuous model demonstrated significant alignment with the original chemical space while also effectively predicting desirable chemical properties.

Advantages of Continuous Representations

The continuous molecular representation method introduced several notable advantages:

Automated Generation: The approach eschews the need for hand-specified mutation rules, allowing the automatic generation of new compounds.
Gradient-Based Search: The differentiable nature of the model combined with gradient-based methods facilitates significant improvements in chemical space exploration efficiency.
Implicit Large Library Creation: Large chemical databases can be leveraged more effectively, reducing constraints imposed by label scarcity.

Implications and Future Research Directions

This approach symbolizes a significant shift in computational molecular design methodologies. The practical implications span multiple domains, including drug discovery and materials science, potentially reducing costs and time associated with experimental compound synthesis and evaluation.

Future research can build on this framework to enhance prediction accuracy and model robustness. For instance, transitioning to graph-based autoencoders could bypass limitations associated with string-based representations. Moreover, enhancements in generating valid compounds only, via architectural innovations or smarter constraints, are areas ripe for exploration. Integrating adversarial networks or explicitly defined grammars for molecular validation represents another promising direction, potentially augmenting the model's capacity for producing chemically and synthetically viable compounds.

Conclusion

The paper proposes a rigorous and innovative framework for chemical design, employing continuous representations to navigate the vast chemical space more effectively. By coupling generative models with gradient optimization and property prediction, the authors present a versatile and potent tool for molecular discovery. The immediate performance metrics coupled with its scalable nature underscore the model’s potential to revolutionize computational chemistry and related domains. Future advancements could further fortify its applicability, bringing us closer to fully autonomous chemical design systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/anthonygitter/status/1846182180667421089