Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning (1902.03932v2)

Published 11 Feb 2019 in cs.LG, cs.AI, cs.CV, stat.ME, and stat.ML

Abstract: The posteriors over neural network weights are high dimensional and multimodal. Each mode typically characterizes a meaningfully different representation of the data. We develop Cyclical Stochastic Gradient MCMC (SG-MCMC) to automatically explore such distributions. In particular, we propose a cyclical stepsize schedule, where larger steps discover new modes, and smaller steps characterize each mode. We also prove non-asymptotic convergence of our proposed algorithm. Moreover, we provide extensive experimental results, including ImageNet, to demonstrate the scalability and effectiveness of cyclical SG-MCMC in learning complex multimodal distributions, especially for fully Bayesian inference with modern deep neural networks.

Citations (252)

View on Semantic Scholar

Summary

The paper proposes a novel cyclical SG-MCMC method that alternates between exploration and sampling phases to efficiently traverse multimodal posterior distributions.
It provides rigorous non-asymptotic convergence guarantees and characterizes sample averages using Wasserstein distances, strengthening its theoretical foundation.
Empirical evaluations on tasks like CIFAR and ImageNet show lower classification errors and improved uncertainty estimates compared to traditional MCMC methods.

Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning

The paper introduces a novel method termed Cyclical Stochastic Gradient Markov Chain Monte Carlo (cSG-MCMC), aimed at enhancing the sampling efficiency and scalability of Bayesian inference in deep neural networks. Traditional Bayesian methods, such as MCMC, have been largely underutilized in the context of deep learning, primarily due to the computational challenges posed by the high-dimensional and multimodal nature of posteriors over neural network weights. The cSG-MCMC approach addresses this challenge via a cyclical stepsize schedule that alternates between exploration and exploitation phases, facilitating effective traversal and characterization of multimodal distributions.

Methodology

The core innovation in cSG-MCMC lies in its adaptive stepsize, which allows the sampler to adjust its behavior in response to varying local geometries of the parameter space. The stepsize increases periodically, encouraging exploration, and decreases within each cycle to refine sampling around discovered modes. This cyclical schedule draws inspiration from cyclical annealing strategies known in optimization literature, yet here it is tailored to the stochastic gradient MCMC framework.

Two major phases are incorporated in each cycle of the cSG-MCMC:

Exploration Stage: Initiates with larger stepsizes to facilitate the discovery of new modes, effectively acting as a warm restart for the sampler.
Sampling Stage: Smaller stepsizes are used for fine-grained exploration within discovered modes, gathering samples for posterior approximation.

Additionally, to efficiently combine samples from different cycles and overcome potential biases introduced by the cyclical nature, the paper proposes weighting schemes using system temperature techniques, adapting the harmonic mean methods to adjust relative cycle importance.

Theoretical Contributions

The paper provides a rigorous theoretical foundation for the proposed method by establishing non-asymptotic convergence rates for the cSG-MCMC algorithm, a previously unexplored area for cyclical stepsize methods in MCMC literature. Specifically, it derives convergence guarantees regarding sample averages over test functions, and characterizes convergence in terms of Wasserstein distances. These results assert that the cyclical schedule can achieve faster convergence relative to traditional MCMC techniques under practical computational constraints.

Empirical Evaluation

Empirical results reinforce the effectiveness of cSG-MCMC. On synthetic multimodal distributions, cSG-MCMC demonstrates superior mode coverage compared to traditional SGLD, under both parallel and single-chain settings. When applied to Bayesian deep learning for image classification tasks, including CIFAR-10/100 and ImageNet, the method consistently improves upon both stochastic optimization methods and traditional SG-MCMC in terms of prediction accuracy and uncertainty estimation.

Notably, cyclical SG-MCMC achieves lower classification errors compared to SG-MCMC and Snapshot ensemble methods, illustrating its advantage in leveraging diverse modes of the posterior for robust predictive performance. In uncertainty quantification tasks, cSG-MCMC provides more calibrated and diverse predictive distributions, crucial for out-of-distribution generalization.

Implications and Future Work

This research indicates a promising direction for integrating MCMC techniques in modern Bayesian deep learning, effectively bridging the gap between principled Bayesian inference and high-capacity deep models. The cyclical SG-MCMC offers a practical yet theoretically grounded tool for enhancing model generalization and uncertainty estimation in complex data settings, potentially impacting fields reliant on deep neural networks for decision-making under uncertainty.

Future investigations could explore the integration of cyclical schedules with other MCMC variants, assess their applicability to broader neural architectures, and refine convergence analysis under broader assumptions. Additionally, automated tuning strategies for stepsize schedules, informed by online posterior diagnostics, could further enhance the applicability of this approach in real-world Bayesian workflows.