Papers
Topics
Authors
Recent
2000 character limit reached

Variational Learning is Effective for Large Deep Networks

Published 27 Feb 2024 in cs.LG, cs.AI, cs.CL, math.OC, and stat.ML | (2402.17641v2)

Abstract: We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in LLMs, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), 2015.
  2. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In International Workshop on Semantic Evaluation (SemEval-2017), 2017.
  3. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
  4. Wide mean-field Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  5. Model merging by uncertainty-based gradient matching. In International Conference on Learning Representations (ICLR), 2024.
  6. Equilibrated adaptive learning rates for non-convex optimization. Advances in Neural Information Processing Systems (NeurIPS), 2015.
  7. Sae: Sequential anchored ensembles. arXiv:2201.00649, 2021.
  8. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
  10. Automatically constructing a corpus of sentential paraphrases. In International Workshop on Paraphrasing (IWP2005), 2005.
  11. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  12. On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33, 2020.
  13. ’in-between’uncertainty in bayesian neural networks. ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
  14. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations (ICLR), 2021.
  15. Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2022.
  16. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
  17. OpenWebText corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
  18. Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2011.
  19. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
  20. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016b.
  21. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In International Conference on Learning Representations (ICLR), 2023.
  22. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.
  23. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
  24. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning (ICML), 2015.
  25. Stochastic variational inference. J. Mach. Learn. Res. (JMLR), 14(5), 2013.
  26. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  27. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  28. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
  29. What are Bayesian neural network posteriors really like? In International Conference on Machine Learning (ICML), 2021.
  30. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  31. The Bayesian learning rule. arXiv:2107.04562, 2021.
  32. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In International Conference on Machine Learning (ICML), 2018.
  33. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. arXiv:1412.6980.
  34. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  35. Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), 2017.
  36. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  37. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  38. Tiny imagenet visual recognition challenge. Technical report, Stanford University, 2015.
  39. The winograd schema challenge. In International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
  40. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.
  41. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In International Conference on Machine Learning (ICML), 2019.
  42. Handling the positive-definite constraint in the Bayesian learning rule. In International Conference on Machine Learning (ICML), 2020.
  43. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv:2305.14342, 2023.
  44. RoBERTa: A robustly optimized BERT pretraining approach, 2019.
  45. Decoupled weight decay regularization. arXiv:1711.05101, 2017.
  46. Learning word vectors for sentiment analysis. In Association for Computational Linguistics (ACL), 2011.
  47. MacKay, D. J. C. A practical Bayesian framework for backpropagation networks. Neural Comput., 4(3):448–472, 1992.
  48. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  49. SAM as an optimal relaxation of Bayes. In International Conference on Learning Representations (ICLR), 2023.
  50. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  51. The memory perturbation equation: Understanding model’s sensitivity to data. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  52. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  53. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  54. Practical deep learning with Bayesian principles. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  55. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Association for Computational Linguistics (ACL), 2005.
  56. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  57. Sato, M.-A. Online model selection based on the variational Bayes. Neural computation, 13(7):1649–1681, 2001.
  58. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  59. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
  60. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
  61. Overpruning in variational Bayesian neural networks. In Advances in Approximate Bayesian Inference, 2017.
  62. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  63. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.
  64. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
  65. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning (ICML), 2011.
  66. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. In International Conference on Learning Representations (ICLR), 2018.
  67. How good is the bayes posterior in deep neural networks really? In International Conference on Machine Learning (ICML), 2020.
  68. A broad-coverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
  69. Evaluating approximate inference in Bayesian deep learning. In Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, 2022.
  70. AdaHessian: an adaptive second order optimizer for machine learning. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
  71. Noisy natural gradient as variational inference. In International Conference on Machine Learning (ICML), 2018.
  72. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
Citations (9)

Summary

  • The paper presents IVON, a novel variational learning optimizer that delivers state-of-the-art accuracy and improved uncertainty estimation in large deep networks.
  • Empirical evaluations on models like GPT-2 and ResNet-50 show that IVON reduces perplexity and enhances calibration compared to AdamW with similar computational efficiency.
  • The work offers practical guidelines and a PyTorch implementation, enabling immediate application in tasks such as fine-tuning, model merging, and robust uncertainty prediction.

Essay: Variational Learning is Effective for Large Deep Networks

The paper "Variational Learning is Effective for Large Deep Networks" presents a rigorous investigation into the capabilities of variational learning techniques, specifically focusing on large-scale neural networks. This study challenges the prevailing skepticism surrounding the use of variational methods in large deep learning models, offering a comprehensive empirical analysis to support its claims.

Overview

The authors introduce the Improved Variational Online Newton (IVON) optimizer, which they compare directly to the widely adopted Adam optimizer. They demonstrate through extensive experiments that IVON matches or even surpasses Adam in training large neural networks like GPT-2 and ResNets. Notably, IVON achieves this with computational costs nearly identical to those of Adam, while also providing superior predictive uncertainty.

Key Contributions

  1. Introduction of IVON: The paper presents IVON, a novel optimizer adapted to tackle the challenges of large-scale variational learning. IVON is shown to provide state-of-the-art accuracy and uncertainty estimation while being computationally efficient.
  2. Comprehensive Empirical Evaluation: The authors support their claims through experiments on various models and datasets. For instance, they train LLMs such as GPT-2 with 773M parameters, showcasing a reduction in validation perplexity when compared to AdamW. Similarly, experiments on ResNet-50 with ImageNet reveal that IVON achieves better calibration and accuracy than AdamW.
  3. Application to Downstream Tasks: The paper highlights several use cases of IVON, including fine-tuning and model merging in LLMs, predicting generalization errors, and assessing sensitivity to data variations.
  4. Improved Predictive Uncertainty: By leveraging variational learning's natural propensity for uncertainty estimation, IVON reliably enhances predictive uncertainties through posterior averaging, outperforming other methods such as MC-dropout and SWAG.
  5. Practical Guideline and Implementation: The authors provide comprehensive guidelines and a PyTorch implementation, making IVON accessible as a drop-in replacement for existing optimizers like Adam.

Numerical Results

The numerical analysis is robust, with IVON showing improvements over Adam in multiple metrics. For instance, in training the GPT-2 model on OpenWebText, IVON yields a validation perplexity of 12.6 compared to AdamW's 13.0. Moreover, on tasks like ImageNet classification, IVON surpasses AdamW in accuracy and calibration, improving top-1 accuracy to 77.46%.

Implications and Future Prospects

The findings of this paper have significant implications for both theoretical understanding and practical applications. From a theoretical standpoint, they affirm the viability of variational methods in the Bayesian optimization of large neural networks. Practically, IVON's ability to enhance performance without incurring additional computational costs positions it as a compelling alternative to traditional optimizers.

The potential applications of IVON extend beyond LLMs and ImageNet-scale models. The methodology could be instrumental in complex tasks that demand reliable uncertainty estimation, such as domain adaptation, active learning, and robust speech recognition systems.

Future developments might explore richer posterior distributions and automated hyperparameter tuning methods to further amplify the strengths of variational learning frameworks. Flexibility inherent in the framework allows researchers to incorporate novel probabilistic models, thereby broadening its applicability.

Conclusion

In summary, the paper presents a well-founded argument for reconsidering variational learning for large deep networks. IVON emerges as a versatile and efficient tool that not only challenges prevailing beliefs but also opens new avenues for research and application in Bayesian deep learning methodologies. The implications of this work suggest a promising new direction for optimizing large-scale neural networks, paving the way for more reliable and robust AI systems.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 359 likes about this paper.