Variational inference for Monte Carlo objectives (1602.06725v2)

Published 22 Feb 2016 in cs.LG and stat.ML

Abstract: Recent progress in deep latent variable models has largely been driven by the development of flexible and scalable variational inference methods. Variational training of this type involves maximizing a lower bound on the log-likelihood, using samples from the variational posterior to compute the required gradients. Recently, Burda et al. (2016) have derived a tighter lower bound using a multi-sample importance sampling estimate of the likelihood and showed that optimizing it yields models that use more of their capacity and achieve higher likelihoods. This development showed the importance of such multi-sample objectives and explained the success of several related approaches. We extend the multi-sample approach to discrete latent variables and analyze the difficulty encountered when estimating the gradients involved. We then develop the first unbiased gradient estimator designed for importance-sampled objectives and evaluate it at training generative and structured output prediction models. The resulting estimator, which is based on low-variance per-sample learning signals, is both simpler and more effective than the NVIL estimator proposed for the single-sample variational objective, and is competitive with the currently used biased estimators.

Authors (2)

Andriy Mnih (25 papers)
Danilo J. Rezende (28 papers)

Citations (278)

View on Semantic Scholar

Summary

Variational Inference for Monte Carlo Objectives: A Technical Investigation

This paper by Andriy Mnih and Danilo J. Rezende addresses a key challenge in training deep latent variable models: the inefficiency of gradient estimators when applied to discrete latent variables in a multi-sample context. The authors introduce an unbiased gradient estimator for importance-sampled objectives, contributing to the broader field of scalable variational inference methods.

Summary and Key Contributions

The foundational motivation for the work is the need to improve training methods for models involving latent variables, particularly those that use a variational approach with a complex posterior approximation. Previously, multi-sample objectives have been shown to yield better log-likelihood performances and utilize model capacities more effectively when compared to single-sample objectives due to their tighter lower bounds on marginal log-likelihoods.

Key Contributions:

Extension to Discrete Latent Variables: The paper extends the applicability of multi-sample methods to discrete latent variables. One of the substantial challenges with discrete variables is the high variance in gradient estimates, which discourages the convergence of models.
Development of VIMCO (Variational Inference for Monte Carlo Objectives): VIMCO offers a breakthrough by providing a novel gradient estimator that features low-variance per-sample learning signals, circumventing the typical inefficiencies encountered with traditional estimators like NVIL. The estimator does not require additional learned parameters for variance reduction, which is beneficial for simplifying integration into larger systems.
Comparison with Existing Methods: The paper compares the performance of the newly developed estimator to existing methods, including NVIL and Reweighted Wake Sleep (RWS). Empirical results suggest that VIMCO performs comparably or better in terms of optimizing objectives and achieving lower variance in model-based learning signals.

Detailed Examination of Gradient Estimators

The crux of this research lies in reformulating the estimation of gradients for the proposal distributions through Monte Carlo methods. The authors provide an analytical breakdown of why gradient estimation is complex, particularly highlighting that conventional single-sample approaches suffer from high variance and lack of local learning signal differentiation.

Naive Gradient Estimator:

The initial method proposed in the paper starts with straightforward Monte Carlo estimates but is gradually enhanced with techniques to reduce variance via baselines, similar to NVIL, and introduces local learning signals for each sample obtained during optimization.

Introduction of Control Variates:

The authors utilize methods akin to control variates from the importance sampling literature to reduce variance further without biasing the gradient estimates, which ensures training efficiency and performance robustness.

Application to Structured Output Prediction

The paper goes beyond theoretical formulation to analyze applications in structured output prediction—a task involving high-dimensional output spaces with intricate structure. Here, the authors achieve notable reductions in negative log-likelihood scores by applying their estimator to models with discrete latent structures, demonstrating the practicality of their method in real-world scenarios.

Implications and Speculation on AI Development

This work enriches the toolbox available for researchers working with variational inference in latent variable models, particularly in scenarios that involve discrete variables. Beyond the immediate implications for model performance and convergence, the techniques developed here could potentially resolve persistent issues in training other complex generative models, such as those used in natural language processing or computer vision. The elimination of learned variance reduction parameters also opens avenues for more straightforward integration with scalable systems, possibly ushering in new architectures that rely more broadly on Monte Carlo objective functions.

Conclusion

In conclusion, Mnih and Rezende's contribution through VIMCO offers a potent alternative to existing biased and high-variance methods for optimizing multi-sample objectives. By addressing the critical bottleneck of gradient estimation variance, the research substantially advances the capabilities of deep generative models, especially within the demanding context of discrete latent structures. As AI continues to evolve, methods like VIMCO are set to play a crucial role in enabling more efficient and powerful model training paradigms.

PDF Markdown

Related Papers

Find Related Papers