Bayesian Neural Network Priors Revisited (2102.06571v3)

Published 12 Feb 2021 in stat.ML and cs.LG

Abstract: Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

Citations (134)

View on Semantic Scholar

Summary

The paper empirically demonstrates that FCNNs exhibit heavy-tailed weight distributions while CNNs and ResNets show spatial correlations, informing better prior design.
It reveals that heavy-tailed priors enhance classification in FCNNs and correlated Gaussian priors improve performance in CNNs and ResNets under cold posterior conditions.
The study shows that selecting architecture-specific priors can mitigate or exacerbate the cold posterior effect, challenging conventional Bayesian neural network assumptions.

A Reassessment of Bayesian Neural Network Priors

The paper "Bayesian Neural Network Priors Revisited," presented at the International Conference on Learning Representations (ICLR), critiques the standard isotropic Gaussian priors used ubiquitously in Bayesian neural networks (BNNs). These priors are traditionally the default choice in a variety of BNN inference techniques, such as variational inference and Laplace's method. However, this paper questions their appropriateness and seeks to explore alternatives that could potentially enhance performance and mitigate the so-called cold posterior effect observed in neural networks.

The authors delve into the empirical traits of neural network weights, especially those trained using stochastic gradient descent (SGD), across various architectures such as fully connected neural networks (FCNNs), convolutional neural networks (CNNs), and ResNets. These empirical evaluations reveal intrinsic properties that are notably absent in isotropic Gaussian priors. Specifically, FCNNs exhibit heavy-tailed distributions, while CNNs and ResNets show spatial correlations in weights. This observation prompts a departure from Gaussian priors, proposing heavy-tailed priors for FCNNs and correlated Gaussian priors for CNNs and ResNets.

Core Contributions

Empirical Analysis of Weight Distributions: The paper unveils that FCNNs trained with SGD capture heavy-tailed distributions, while CNNs and ResNets encode significant spatial correlations in weights. This insight is crucial as it maps empirical observations directly to prior design, which has been largely ignored with isotropic Gaussian priors.
Performance Implications: Through extensive experimentation on benchmark datasets like MNIST, CIFAR-10, and FashionMNIST, the paper demonstrates that heavy-tailed distributions significantly improve classification performance in FCNNs over Gaussian priors. Conversely, CNNs and ResNets see a performance boost with correlated Gaussian priors especially under cold posterior conditions, though this increases the cold posterior effect in some models like ResNets on CIFAR-10.
Cold Posterior Effect: By choosing better-suited priors, the cold posterior phenomenon can be either mitigated, as seen with FCNNs using heavy-tailed priors, or exacerbated, as correlated priors in ResNets illustrate. This provides compelling evidence suggesting that the effect might be tied to prior specification, challenging previous assumptions.

Theoretical and Practical Implications

This paper calls into question a core assumption in Bayesian deep learning—that isotropic Gaussian priors are adequate. By investigating more complex priors, the work expands understanding in selecting BNN priors that align more closely with the natural distribution of weights in neural networks trained with SGD. This has profound implications on both predictive performance and uncertainty quantification, enhancing applications ranging from natural image recognition to safety-critical AI systems.

Moreover, its implications extend to the theoretical understanding of the cold posterior effect. The mixed results regarding when and how this effect is mitigated or exacerbated invoke a deeper inquiry into the interactions between prior selection, data augmentation strategies, and likelihood specifications.

Speculation on Future Developments

Given the findings, one could foresee the community addressing the BNN prior selection process with an increased focus on architectural and task-specific characteristics. Moreover, future research might delve into hybrid approaches that combine empirical Bayesian techniques with novel prior formulations tailored to particular domains or data regimes.

Overall, the paper advocates for a paradigm shift: rather than universally adopting isotropic Gaussian priors, the field should embrace data-informed alternatives that better mirror the empirical reality of network weights. This strategic reevaluation could lead to new methodologies in BNNs that afford better performance and robustness, fundamentally advancing the capabilities of Bayesian deep learning models.

PDF Markdown

Related Papers

GitHub

GitHub - ratschlab/bnn_priors: Code for the paper "Bayesian Neural Network Priors Revisited" (58 stars)

Tweets

https://twitter.com/statwonk/status/1780361855854121303