Junction Tree Variational Autoencoder for Molecular Graph Generation (1802.04364v4)

Published 12 Feb 2018 in cs.LG, cs.NE, and stat.ML

Abstract: We seek to automate the design of molecules based on specific chemical properties. In computational terms, this task involves continuous embedding and generation of molecular graphs. Our primary contribution is the direct realization of molecular graphs, a task previously approached by generating linear SMILES strings instead of graphs. Our junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network. This approach allows us to incrementally expand molecules while maintaining chemical validity at every step. We evaluate our model on multiple tasks ranging from molecular generation to optimization. Across these tasks, our model outperforms previous state-of-the-art baselines by a significant margin.

PDF Abstract

Junction Tree Variational Autoencoder for Molecular Graph Generation

In the field of automated drug discovery, designing molecules that adhere to specific chemical properties is a task of paramount importance. The goal of the paper "Junction Tree Variational Autoencoder for Molecular Graph Generation" by Wengong Jin, Regina Barzilay, and Tommi Jaakkola, is to generate molecular graphs directly, thereby preserving chemical validity at every step of generation. Here, a Junction Tree Variational Autoencoder (JT-VAE) is introduced for this purpose, which departs from prior methods that primarily generated linear SMILES strings.

Introduction and Motivation

The core challenge addressed in this paper is automating the design and optimization of drug-like molecules, traditionally a manual and time-consuming process. From a computational perspective, the problem is broken down into two tasks:

Encoding molecules into continuous representations suitable for property prediction and optimization.
Decoding the optimized continuous representation back into a valid molecular graph.

Existing SMILES-based approaches, such as those by Gómez-Bombarelli et al. (2016) and Kusner et al. (2017), are limited by the inherently discontinuous nature of the SMILES representation and the difficulty of ensuring chemical validity throughout generation. The proposed JT-VAE adopts a graph-centric approach, offering a more natural framework for capturing molecular similarity and ensuring validity.

Methodology

The JT-VAE model generates molecular graphs in two distinct phases:

Junction Tree Generation: Molecules are first represented as junction trees, which are tree-structured scaffolds over chemical substructures. These substructures are valid chemical components extracted from the training set.
Graph Assembly: The substructures are then combined into a full molecule using a graph message passing network.

Junction Tree Representation

A molecule is decomposed into subgraphs using a tree decomposition approach tailored for molecules. Specifically, the decomposition yields a junction tree where each node corresponds to a chemical substructure. This representation is effective because it ensures intermediate structures during generation are chemically feasible.

Encoding and Decoding

The JT-VAE extends the traditional VAE framework by introducing two-part latent representations:

The tree encoder ($q(\z_\tree|\tree)$) captures the coarse arrangement of the substructures.
The graph encoder ($q(\z_G|G)$) captures the fine-grained connectivity within the molecular graph.

Once encoded, the latent representations are decoded through:

A tree decoder ($p(\tree|\z_\tree)$) that reconstructs the junction tree from its latent embedding.
A graph decoder ($p(G|\tree, \z_G)$) that fine-tunes the connectivity between substructures to realize the complete molecule.

The decoding process ensures that the chemical structure remains valid by assembling molecules piece by piece rather than atom by atom.

Experimental Evaluation

JT-VAE is evaluated on several tasks, demonstrating its superiority over existing methods.

Reconstruction and Validity

The JT-VAE achieves high reconstruction accuracy while ensuring 100% chemical validity of generated molecules, outperforming SMILES-based and atom-by-atom graph generation methods. The prior validity of molecules sampled from the latent space was also 100%.

Bayesian Optimization

JT-VAE is tested on the task of discovering novel molecules with optimized properties (e.g., octanol-water partition coefficients (logP) penalized by synthetic accessibility and long cycles). The model significantly outperforms SMILES-based methods, finding molecules with much better property scores.

Constrained Molecule Optimization

In a more realistic setup, where the goal is to modify existing molecules to improve specific properties while maintaining structural similarity, JT-VAE shows promising results. The model smoothly adjusts the molecular structures within the latent space to yield improved chemical properties.

Implications and Future Work

The JT-VAE presents several implications for both practical applications in drug discovery and theoretical development in graph-based generative models:

Chemical Validity: By generating molecular structures in a coarse-to-fine manner and using valid substructures, JT-VAE ensures chemical feasibility at every generation step.
Optimized Molecule Discovery: The JT-VAE's ability to support efficient and effective optimization of molecular properties can significantly speed up the drug discovery process.
Latent Space Representations: The smooth latent space transformations facilitated by JT-VAE can enhance subsequent optimization and property prediction tasks.

Conclusion

The novel JT-VAE model for molecular graph generation addresses the critical limitations of SMILES-based approaches, offering a robust framework for designing valid and optimized molecular structures. Future work could extend the applicability of JT-VAE to other graph domains and explore enhancements for even larger and more complex molecular datasets.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Wengong Jin (25 papers)
Regina Barzilay (106 papers)
Tommi Jaakkola (115 papers)

Citations (1,253)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - wengong-jin/icml18-jtnn: Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018) (527 stars)

Tweets

https://twitter.com/Hypat1aYu/status/1879064503956185432

YouTube

Show All Videos