Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks (1512.07666v1)

Published 23 Dec 2015 in stat.ML

Abstract: Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

Citations (312)

View on Semantic Scholar

Summary

The paper introduces pSGLD, integrating adaptive preconditioning into SGLD to balance parameter curvature and enhance training efficiency.
Theoretical analysis proves asymptotic convergence and bounds on predictive risk, ensuring robust performance in complex neural architectures.
Experimental results demonstrate that pSGLD outperforms standard SGLD, SGD, and RMSprop in convergence speed and test accuracy across various models.

Summary of "Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks"

The paper proposes an enhancement to Stochastic Gradient Langevin Dynamics (SGLD), a popular approach for training deep neural networks (DNNs) that incorporates Bayesian model averaging to mitigate overfitting issues inherent in large-scale deep learning. The authors address the inefficiencies in standard SGLD methods caused by rapidly changing curvature in the parameter space, which is a common problem in DNN training.

Key Contributions

Preconditioned SGLD (pSGLD): The paper introduces Preconditioned Stochastic Gradient Langevin Dynamics (pSGLD), a novel integration of adaptive preconditioning techniques from optimization (notably similar to RMSprop) into SGLD. This integration aims to balance the curvature across different dimensions of parameter space, thereby standardizing the effective stepsizes and improving efficiency.
Numerical and Theoretical Results: The authors provide an in-depth theoretical analysis of pSGLD, proving asymptotic convergence properties and establishing bounds on predictive risk. They demonstrate that pSGLD maintains the benefits of SGLD, such as scaling to large datasets, but with enhanced mixing and convergence, particularly in the presence of high curvature variance.
Experimental Validation: Empirical evaluations are conducted on logistic regression models, feedforward neural networks (FNNs), and convolutional neural networks (CNNs). The authors present convincing empirical evidence that pSGLD outperforms standard SGLD and other optimization techniques (e.g., SGD, RMSprop) in terms of both convergence speed and test-set performance across these models.

Practical and Theoretical Implications

Theoretically, the paper adds to the existing literature by bridging adaptive geometry-aware learning methods from stochastic optimization with stochastic sampling algorithms in the form of SGLD. The use of preconditioning allows pSGLD to navigate the complex loss landscapes of DNNs more effectively, suggesting potential wider applicability in other machine learning models that suffer from similar curvature issues.

Practically, the introduction of pSGLD provides a robust alternative for practitioners looking to incorporate uncertainty in neural network predictions while avoiding computationally expensive Bayesian inference techniques. The trivial per-iteration computational overhead makes it an attractive option for large-scale architectures such as those used in modern deep learning tasks, all while maintaining competitive levels of test accuracy.

Future Directions

Future research could explore extending pSGLD to other forms of deep neural architectures, including recurrent neural networks and autoencoders, where complex parameter landscapes are prevalent. Additionally, the impact of different forms of preconditioners and step-sizing schedules could be investigated to further enhance the robustness and efficiency of the method in various settings.

Overall, this work represents a significant step forward in efficiently training DNNs with Bayesian techniques, combining theoretical rigor with practical usability.

PDF Markdown