Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Sparse Bayesian Neural Networks

Updated 30 June 2025

Sparse Bayesian Neural Networks are probabilistic models that combine deep architectures with shrinkage priors to enforce sparsity and enable principled uncertainty quantification.
They achieve minimax-optimal posterior contraction rates by adapting automatically to the intrinsic smoothness and composite structure of functions in Besov spaces.
These networks effectively exploit low-dimensional structures to mitigate the curse of dimensionality, ensuring robust performance in high-dimensional structured estimation tasks.

Sparse Bayesian Neural Networks are probabilistic neural models that combine expressive deep architectures with Bayesian inference and shrinkage mechanisms to enforce sparsity in their parameters. By assigning sparse-inducing priors to the weights and, in some cases, to the network’s architectural configuration, these networks allow the posterior to concentrate on parsimonious representations—automatically discarding redundant parameters—while still delivering principled uncertainty quantification. The recent theoretical development in "Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality" (2506.19144) establishes the minimax-optimality and adaptivity of sparse BNNs across a broad class of high-dimensional, composite, and possibly non-smooth function classes, and rigorously characterizes the role of intrinsic dimensionality in their learning rates.

1. Posterior Contraction Rates for Sparse Bayesian Neural Networks

The central focus is on the posterior contraction rate: the speed at which the Bayesian posterior—given data—shrinks around the true function $f_0$ as the number of samples $n$ grows. For regression problems of the form $Y_i = f_0(X_i) + \xi_i$ , where $f_0$ lies within an anisotropic or composite Besov function space, the contraction rate for sparse BNNs equipped with suitable priors is

$\epsilon_n = n^{-\tilde{s}/(2\tilde{s} + 1)} (\log n)^{3/2},$

where $\tilde{s}$ is the intrinsic smoothness parameter of the function space. In the case of hierarchical (composite) Besov spaces, characterized by layered compositions and possibly variable selection at each layer, the rate becomes

$\epsilon_n = n^{-\tilde{s}^*/(2\tilde{s}^* + 1)} (\log n)^{3/2},$

where $\tilde{s}^*$ reflects the effective smoothness determined by the most complex part of the hierarchy.

These rates are minimax-optimal: they match information-theoretic lower bounds for function estimation over the given spaces. Formally, the posterior probability assigned to functions outside an $M_n \epsilon_n$ neighborhood of the truth vanishes as $n \to \infty$ , for any diverging sequence $M_n$ , guaranteeing concentration around the true function at the optimal rate under appropriate priors.

2. Besov Spaces, Function Composition, and Their Role in Sparse BNNs

Besov spaces ( $B_{p,q}^s$ ), especially in anisotropic formulation, generalize classical smoothness spaces by allowing direction-specific smoothness and are especially suited to modeling non-smooth, discontinuous, or localized features. The intrinsic smoothness parameter,

$\tilde{s} = \left( \sum_{j=1}^{d} s_j^{-1} \right)^{-1},$

where $s_j$ denotes smoothness along coordinate $j$ , quantifies the true difficulty of estimation.

The theory extends to composite/hierarchical Besov spaces, where the function can be decomposed as

$f_0 = f_H \circ f_{H-1} \circ \cdots \circ f_1,$

each possibly acting on only a subset of variables, with variable selection performed at every compositional level. This structure naturally encompasses additive, multiplicative, and more general deep functions, reflecting the architectural expressiveness of deep neural networks.

Importantly, in these settings, the contraction rate is governed by the complexity of the "hardest" part of the composite (hierarchically deepest or least smooth), not the ambient input dimensionality, leveraging the function’s intrinsic, hierarchical structure.

3. Intrinsic Dimensionality and Mitigation of the Curse of Dimensionality

The contraction rates derived depend not on the full input dimension $d$ , but on the intrinsic dimension as defined by the effective smoothness $\tilde{s}$ and layers’ variable subsets. Specifically, for an anisotropic Besov space,

$d^* = \frac{\underline{s}}{\tilde{s}},\quad \underline{s} = \min_j s_j,$

and for composite spaces, the intrinsic dimension can be recursively defined layer-wise via the composition and smoothness structure.

This result rigorously demonstrates that sparse BNNs equipped with appropriate priors are capable of adapting to the function’s low-dimensional structure, thus dodging the curse of dimensionality even when the ambient input space is very large. This theoretical insight provides a justification for the empirical success of sparse deep models in structured, high-dimensional tasks.

4. Shrinkage and Spike-and-Slab Priors for Sparsity

Sparse Bayesian Neural Networks employ shrinkage-type priors to induce sparsity in the weights—and potentially in architecture. Two central classes are:

Spike-and-slab priors: Mixtures of a point mass at zero and a diffuse "slab" (e.g., normal or uniform), promoting exact zeros for most coefficients:

$\pi(\theta_j) = \gamma_j \tilde{\pi}_{SL}(\theta_j) + (1-\gamma_j) \delta_0(\theta_j),$

where $\gamma_j$ is a Bernoulli inclusion indicator.

Continuous shrinkage priors: Heavy-tailed, sharply peaked at zero, such as sub-Weibull or spike-and-slab continuous mixtures:

$\pi(\theta_j) = (1-1/T) \varphi_k\Big(\frac{\theta_j}{\sigma_1}\Big) + \frac{1}{T} g(\theta_j),$

where $\varphi_k$ and $g$ are chosen to satisfy concentration and support conditions.

To enable rate adaptation, priors are also placed on architecture-related hyperparameters (e.g., network width, sparsity, depth), with decay rates shaped to guarantee sufficient prior mass near optimal configurations.

5. Rate Adaptation for Unknown Smoothness and Architecture

A key property of the Bayesian formulation established in this work is automatic rate adaptation: the BNN’s posterior achieves the minimax contraction rate without requiring any prior knowledge of the true function’s smoothness or compositional complexity. Rate adaptation is accomplished by introducing suitably decaying priors on the architectural components (e.g., number of nonzero nodes, network depth), such as

$\pi_N(N) \propto \exp\left(-\lambda_N N (\log N)^3\right), \quad \pi_H(H) \propto \exp\left(-\lambda_H H (\log H)^2\right),$

ensuring that the posterior contracts at the optimal rate $n^{-\tilde{s}^*/(2\tilde{s}^*+1)}$ regardless of the unknown parameter values.

This adaptivity is not limited to spike-and-slab priors; the work proves analogous adaptation for continuous-shrinkage priors satisfying the specified concentration conditions.

6. Implications for High-dimensional Structured Estimation

These results provide a modern theoretical justification for the practical effectiveness of sparse Bayesian neural networks:

Robustness to high dimensionality: Learning rates depend on intrinsic structural complexity, not ambient dimension, provided the true function admits compositional or low-dimensional representations.
Automatic variable and architecture selection: The posterior automatically discards irrelevant connections and adapts the model size to the underlying data complexity.
Broad function class coverage: Results cover anisotropic Besov, additive, multiplicative, and general composite functions, matching the types of dependencies often targeted in applied science and engineering.
Calibrated uncertainty quantification: Fully Bayesian treatment delivers not only point estimators but also credible sets reflecting data-informed posterior beliefs.

Table: Contrasting Prior and Posterior Features

Feature	Sparse/Spike-and-Slab Priors	Continuous Shrinkage Priors
Exact zeros in parameters	Yes	No (but effective near-zero shrinkage)
Rate-optimal posterior contraction	Yes	Yes
Posterior adaptation to architecture	Yes (with suitable prior on config.)	Yes
Efficient for composite/hierarchical	Yes	Yes

The analytical results unify and extend earlier theory that lacked adaptation for depth or did not treat highly composite/hierarchical models, thereby underpinning the broad empirical success of sparse BNNs in high-dimensional structured learning problems.

7. Summary

Sparse Bayesian Neural Networks, when equipped with appropriately designed shrinkage or spike-and-slab priors over both network weights and architectures, achieve optimal posterior contraction rates over complex Besov-type function classes. These rates depend solely on the function’s intrinsic dimensionality rather than the ambient space, ensuring adaptivity and bypassing the curse of dimensionality. This result resolves an open theoretical question by demonstrating both minimax-optimality and automatic adaptation in fully Bayesian deep learning for a wide range of structured estimands. The theoretical foundation directly explains the practical efficiency and robustness of sparse BNNs in contemporary machine learning applications (2506.19144).

PDF Markdown Chat (Upgrade)

References (1)

Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality (2025)