Variational Learning is Effective for Large Deep Networks
Abstract: We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in LLMs, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.
- Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), 2015.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In International Workshop on Semantic Evaluation (SemEval-2017), 2017.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
- Wide mean-field Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
- Model merging by uncertainty-based gradient matching. In International Conference on Learning Representations (ICLR), 2024.
- Equilibrated adaptive learning rates for non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS), 2015.
- Sae: Sequential anchored ensembles. arXiv:2201.00649, 2021.
- ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing (IWP2005), 2005.
- What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33, 2020.
- ’in-between’uncertainty in bayesian neural networks. ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations (ICLR), 2021.
- Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2022.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
- OpenWebText corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
- Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2011.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
- Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016b.
- DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In International Conference on Learning Representations (ICLR), 2023.
- Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
- Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning (ICML), 2015.
- Stochastic variational inference. J. Mach. Learn. Res. (JMLR), 14(5), 2013.
- An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
- What are Bayesian neural network posteriors really like? In International Conference on Machine Learning (ICML), 2021.
- Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- The Bayesian learning rule. arXiv:2107.04562, 2021.
- Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In International Conference on Machine Learning (ICML), 2018.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. arXiv:1412.6980.
- Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
- Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), 2017.
- Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Tiny imagenet visual recognition challenge. Technical report, Stanford University, 2015.
- The winograd schema challenge. In International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
- Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.
- Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning (ICML), 2019.
- Handling the positive-definite constraint in the Bayesian learning rule. In International Conference on Machine Learning (ICML), 2020.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv:2305.14342, 2023.
- RoBERTa: A robustly optimized BERT pretraining approach, 2019.
- Decoupled weight decay regularization. arXiv:1711.05101, 2017.
- Learning word vectors for sentiment analysis. In Association for Computational Linguistics (ACL), 2011.
- MacKay, D. J. C. A practical Bayesian framework for backpropagation networks. Neural Comput., 4(3):448–472, 1992.
- A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- SAM as an optimal relaxation of Bayes. In International Conference on Learning Representations (ICLR), 2023.
- Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- The memory perturbation equation: Understanding model’s sensitivity to data. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
- Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Practical deep learning with Bayesian principles. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Association for Computational Linguistics (ACL), 2005.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sato, M.-A. Online model selection based on the variational Bayes. Neural computation, 13(7):1649–1681, 2001.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
- Overpruning in variational Bayesian neural networks. In Advances in Approximate Bayesian Inference, 2017.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.
- Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
- Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning (ICML), 2011.
- Flipout: Efficient pseudo-independent weight perturbations on mini-batches. In International Conference on Learning Representations (ICLR), 2018.
- How good is the bayes posterior in deep neural networks really? In International Conference on Machine Learning (ICML), 2020.
- A broad-coverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
- Evaluating approximate inference in Bayesian deep learning. In Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, 2022.
- AdaHessian: an adaptive second order optimizer for machine learning. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
- Noisy natural gradient as variational inference. In International Conference on Machine Learning (ICML), 2018.
- Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.