Rapid Bayesian Computation and Estimation for Neural Networks via Log-Concave Coupling (2411.17667v3)
Abstract: This paper studies a Bayesian estimation procedure for single-hidden-layer neural networks using $\ell_{1}$ controlled weights. We study the structure of the posterior density and provide a representation that makes it amenable to rapid sampling via Markov Chain Monte Carlo (MCMC), and to statistical risk guarantees. The neural network has $K$ neurons, internal weight dimension $d$, and fix the outer weights. Thus, $Kd$ parameters overall. With $N$ data observations, use a gain parameter of $\beta$ in the posterior density. The posterior is multimodal and not naturally suited to rapid mixing of direct MCMC algorithms. For a continuous uniform prior on the $\ell_{1}$ ball, we show that the posterior density can be written as a mixture density with suitably defined auxiliary random variables, where the mixture components are log-concave. Furthermore, when the number of model parameters $Kd$ is large enough that $Kd \geq C(\beta N){2}$, the mixing distribution of the auxiliary random variables is also log-concave. Thus, neuron parameters can be sampled from the posterior by only sampling log-concave densities. The authors refer to the mixture density as a log-concave coupling. For a discrete uniform prior restricted to a grid, we study the statistical risk (generalization error) of procedures based on the posterior. Using a gain of $\beta = C [(\log d)/N]{1/4}$, we demonstrate squared error is on the order $O([(\log d)/N]{1/4})$. Using independent Gaussian data with a variance $\sigma{2} $ that matches the inverse gain, $\beta = 1/\sigma{2}$, we show that the expected Kullback divergence has a cube root power $O([(\log d)/N]{1/3})$. Future work aims to bridge the sampling ability of the continuous uniform prior with the risk control of the discrete uniform prior, resulting in a polynomial time Bayesian training algorithm for neural networks with statistical risk control.