Grammar Variational Autoencoder (1703.01925v1)

Published 6 Mar 2017 in stat.ML

Abstract: Deep generative models have been wildly successful at learning coherent latent representations for continuous data such as video and audio. However, generative modeling of discrete data such as arithmetic expressions and molecular structures still poses significant challenges. Crucially, state-of-the-art methods often produce outputs that are not valid. We make the key observation that frequently, discrete data can be represented as a parse tree from a context-free grammar. We propose a variational autoencoder which encodes and decodes directly to and from these parse trees, ensuring the generated outputs are always valid. Surprisingly, we show that not only does our model more often generate valid outputs, it also learns a more coherent latent space in which nearby points decode to similar discrete outputs. We demonstrate the effectiveness of our learned models by showing their improved performance in Bayesian optimization for symbolic regression and molecular synthesis.

PDF Abstract

Grammar Variational Autoencoder

The paper "Grammar Variational Autoencoder," authored by Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato, introduces an innovative model to address the challenges of generating valid discrete data by leveraging context-free grammars (CFG). The proposed Grammar Variational Autoencoder (GVAE) aims to ensure that outputs are valid, overcoming substantial limitations of state-of-the-art methods in generating arithmetic expressions and molecular structures.

Introduction and Background

Deep generative models have excelled in representing continuous data, enabling remarkable achievements in areas such as music, image processing, and artwork generation. However, extending these models to discrete data types like arithmetic expressions, symbolic expressions, and molecular strings presents significant difficulties, primarily due to the brittleness of text-based representations. Given the parser structure in CFG, the paper proposes GVAE, which encodes and decodes to and from these parse trees, ensuring syntactic validity directly through the generation process.

Methods

The GVAE encodes data by transforming discrete objects into sequences of production rules delineated by a CFG. This approach captures the syntactic structure explicitly within the latent space. The encoding process involves converting a parse tree into one-hot encoded vectors, which a deep convolutional neural network then maps into a continuous latent space.

For decoding, an RNN produces unnormalized log probability vectors from the continuous latent vectors, constrained by a stack-driven parsing mechanism. This constraint ensures that only valid production rules according to the CFG are selected, preserving syntactic correctness throughout.

Illustrative Example

To exemplify, the encoding procedure involves parsing a SMILES string into a sequence of production rules, then encoding these into a latent vector through CNNs. Decoding reconstructs valid syntax by sampling production rules constrained by the CFG and managed by a stack maintaining the state of parsing.

Experiments

The GVAE model was tested on two discrete generation problems: generating arithmetic expressions and generating molecular structures. Latent space representations showed superior smoothness and relevance preservation compared to character-based VAEs.

Arithmetic Expressions: GVAE demonstrated smooth interpolation between expressions, generating valid intermediate outputs. The smooth and consistent latent space representation allowed for optimizing symbolic regression problems effectively.
Molecules: Using a dataset from the ZINC database, GVAE significantly outperformed traditional character-based VAEs in generating valid molecular structures. Furthermore, the latent representations in GVAE enabled efficient optimization for drug properties using Bayesian optimization.

Results

Quantitative results underline the robustness of GVAE:

The fraction of valid sequences generated by GVAE was notably higher (approximately 99%) compared to character VAEs.
Bayesian optimization performed on the GVAE latent space consistently found solutions with better scores.
Molecule reconstruction accuracy and valid outputs when sampling from the prior were significantly improved in GVAE.

The results validate that GVAE reduces the space of invalid outputs while promoting the generation of syntactically correct and semantically meaningful sequences.

Implications and Future Directions

The implications of this research span multiple facets:

Practical Applications: In fields like drug discovery, GVAE can provide a robust solution to design molecules with desirable properties efficiently.
Theoretical Advancements: This model exemplifies how incorporating explicitly defined grammatical structures into deep generative models shifts the focus from learning syntactic rules to learning semantically meaningful representations.

Future research can explore extending GVAE applications to broader discrete data domains and enhancing the semantic correctness of generated sequences by incorporating additional domain-specific constraints. Moreover, integrating probabilistic CFGs or context-sensitive grammars could further bolster model performance in more complex data generation tasks.

This paper sets a new direction in bridging the gap between syntactic correctness and semantic relevance in discrete generative models, marking a significant stride toward more reliable and interpretable AI models for diverse data types.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Matt J. Kusner (39 papers)
Brooks Paige (43 papers)
José Miguel Hernández-Lobato (151 papers)

Citations (787)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos