Neural Variational Inference and Learning in Belief Networks (1402.0030v2)

Published 31 Jan 2014 in cs.LG and stat.ML

Abstract: Highly expressive directed latent variable models, such as sigmoid belief networks, are difficult to train on large datasets because exact inference in them is intractable and none of the approximate inference methods that have been applied to them scale well. We propose a fast non-iterative approximate inference method that uses a feedforward network to implement efficient exact sampling from the variational posterior. The model and this inference network are trained jointly by maximizing a variational lower bound on the log-likelihood. Although the naive estimator of the inference model gradient is too high-variance to be useful, we make it practical by applying several straightforward model-independent variance reduction techniques. Applying our approach to training sigmoid belief networks and deep autoregressive networks, we show that it outperforms the wake-sleep algorithm on MNIST and achieves state-of-the-art results on the Reuters RCV1 document dataset.

Citations (720)

View on Semantic Scholar

Summary

The paper proposes a novel NVIL method that employs a feedforward network for exact sampling from the variational posterior to improve training of belief networks.
It introduces model-independent variance reduction techniques, such as centering the signal, input-dependent baselines, and normalization, to reduce gradient variance.
NVIL scales efficiently to both discrete and continuous latent variables, achieving superior results on datasets like MNIST and Reuters RCV1.

Neural Variational Inference and Learning in Belief Networks

The paper "Neural Variational Inference and Learning in Belief Networks," authored by Andriy Mnih and Karol Gregor, addresses the challenging problem of training highly expressive directed latent variable models such as sigmoid belief networks (SBNs). These models have traditionally been difficult to train on large datasets due to the intractability of exact inference and the inefficiency of existing approximate inference methods. The authors propose a new method, Neural Variational Inference and Learning (NVIL), which leverages a feedforward network for efficient exact sampling from the variational posterior. The model and inference network are trained jointly by maximizing a variational lower bound on the log-likelihood.

Key Contributions

Inference Network Utilization: NVIL employs a feedforward network to perform variational inference, which allows for exact sampling from the variational posterior. This network is jointly trained with the model by optimizing a variational lower bound on log-likelihood.
Variance Reduction Techniques: To address the high variance issue of the naive gradient estimator for the inference network, the authors introduce several model-independent variance reduction techniques. These include centering the learning signal, employing input-dependent baselines, and normalizing the variance.
Generalization Across Latent Variables: Unlike previous approaches limited to discrete or continuous latent variables, NVIL is capable of handling both types, as well as variational posteriors with complex dependency structures within layers.
Scalability and Efficiency: NVIL does not require storing the latent variables for each observation, making it memory efficient and suitable for pure online learning settings. Each forward pass generates an independent exact sample, avoiding the mixing issues common in MCMC methods.
Performance Evaluation: The authors demonstrate that NVIL outperforms the wake-sleep algorithm on various tasks, including achieving state-of-the-art results on the Reuters RCV1 dataset.

Experimental Results

MNIST Dataset

The authors train single-layer and multi-layer SBNs on the MNIST dataset. Results indicate significant improvements in model performance when NVIL incorporates all three proposed variance reduction techniques. For instance, NVIL-trained SBN models produce lower estimated bounds on log-likelihood compared to wake-sleep-trained models. Using autoregressive networks further enhances performance.

Document Modelling

NVIL is applied to document modelling tasks on the 20 Newsgroups and Reuters RCV1 datasets. The models trained with NVIL achieve competitive, and in some cases superior, test set perplexity scores compared to previously established methods like LDA and Replicated Softmax. An fDARN model with 200 latent variables sets a new benchmark on the RCV1 dataset.

Practical and Theoretical Implications

Practical

NVIL offers a flexible and scalable solution for training directed latent variable models, making it highly applicable to real-world datasets. Its ability to handle both online and offline learning settings, coupled with efficient memory usage, positions it as a versatile tool for various machine learning tasks, particularly those involving large-scale and complex data structures.

Theoretical

The introduction of NVIL advances the theoretical understanding of training directed graphical models using variational approaches. The variance reduction techniques are particularly noteworthy, providing a robust framework for addressing the high variance of gradient estimators in variational inference. This paves the way for further development of more expressive models, potentially incorporating sophisticated architectures with nonlinear dependencies.

Future Directions

Future research could explore several promising directions:

Non-linear Model Architectures: Investigating more expressive architectural variants with non-linearities between layers of stochastic variables.
Continuous Latent Variables: Extending NVIL's framework to models with continuous latent variables, leveraging techniques akin to those in Stochastic Gradient Variational Bayes (SGVB).
Conditional Models: Applying NVIL to conditional latent variable models, enhancing its efficacy in contexts requiring conditional distribution modeling given certain observations.

Conclusion

The Neural Variational Inference and Learning (NVIL) method proposed by Mnih and Gregor represents a significant advancement in training intractable directed latent variable models. By jointly optimizing the model and inference network using a variation of the variational lower bound and applying effective variance reduction techniques, NVIL achieves superior performance compared to previous methods like the wake-sleep algorithm. The empirical results on datasets such as MNIST and Reuters RCV1 underscore its potential both for theoretical exploration and practical application in machine learning.

PDF Markdown

Related Papers

Auto-Encoding Variational Bayes (2013)
Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives (2018)
Reweighted Wake-Sleep (2014)
Probabilistic Circuits for Variational Inference in Discrete Graphical Models (2020)
Variational Rejection Sampling (2018)

YouTube

Show All Videos