Probabilistic Backpropagation
- Probabilistic Backpropagation is a Bayesian inference algorithm that propagates means and variances through neural network layers for both regression and classification tasks.
- It leverages expectation propagation style moment matching to provide tractable approximate posterior inference and well-calibrated predictive uncertainty.
- Extensions to deep Gaussian processes and the use of stochastic EP allow the method to scale efficiently to large datasets while maintaining competitive performance.
Probabilistic Backpropagation (PBP) is a scalable Bayesian inference algorithm for neural networks and their hierarchical extensions, including deep Gaussian processes (DGPs). PBP leverages expectation propagation (EP)-style moment-matching to propagate uncertainty through nonlinear models, enabling both tractable approximate posterior inference and well-calibrated predictive uncertainty. The method is formulated for regression and classification tasks with flexible network architectures and activation functions, and is extensible to probabilistic kernel design and hierarchical Bayesian priors.
1. Bayesian Neural Network Model Formulation
PBP operates in the Bayesian neural network (BNN) setting, where the model is defined as follows. Let the dataset consist of input vectors and scalar outputs (regression) or binary labels (classification). The BNN is constructed with layers of hidden units, weight matrices , and (optionally) hierarchical priors or hyperpriors.
The standard regression case uses a Gaussian likelihood and independent zero-mean Gaussian priors on weights:
with precision hyperpriors and as Gamma distributions (Hernández-Lobato et al., 2015).
For binary classification, the likelihood is logistic regression: where and denotes the pre-activation output (Olobatuyi, 2023).
Hierarchical Gaussian priors may be specified for each weight, with hyperpriors on precisions (e.g., rectified Gaussian or Gamma).
2. Probabilistic Forward and Backward Passes
PBP replaces deterministic layer outputs with propagation of means and variances. For a given input, each layer computes:
- Linear transformation: For the pre-activations , means and variances are propagated using the moments of Gaussian inputs and weights.
- Nonlinear activations: For ReLU, each component is approximated by a truncated-Gaussian whose moments are analytic functions of the incoming mean and variance (Hernández-Lobato et al., 2015). For general activations, one-dimensional quadrature or Taylor expansions compute output means and variances (Olobatuyi, 2023).
- Bias units have fixed values and zero variance; outputs form the next layer's inputs.
This forward pass computes a Gaussian marginal for the network output, enabling closed-form approximations for predictive distributions.
The backward pass consists of differentiating the log normalization constant with respect to parameters at every layer and parameter, using reverse-mode derivatives and chain rules through layer-wise moment formulas. These gradients update the moments of the approximate posterior for each weight.
3. EP-Style Moment Matching and Posterior Updates
PBP implements a local assumed-density filtering or expectation propagation (EP) update as follows. For each data point , remove its corresponding site from the current approximation to obtain the cavity distribution , then form the tilted distribution: Moment matching projects onto a Gaussian family by equating the mean and variance, yielding updated posterior parameters.
The update equations, for a generic Gaussian approximate posterior perturbed by an arbitrary likelihood , are (Minka 2001): where (Hernández-Lobato et al., 2015, Bui et al., 2015, Olobatuyi, 2023).
This scheme applies to data likelihoods, weight priors, and hyperpriors, making inference tractable and decentralized at the level of posterior factors. In variational expectation propagation (VEP), moment-matching is also used for factorized approximate posteriors over hierarchical weights and precisions (Olobatuyi, 2023).
4. Extensions to Deep Gaussian Processes and Stochastic EP
PBP is generalized to deep Gaussian processes (DGPs), where neural network weights are replaced with inducing-point outputs at each layer of GPs (Bui et al., 2015). DGP inference is intractable due to layer-wise marginalization. Stochastic Expectation Propagation (SEP) replaces per-data-point sites with a single "average" site: enabling efficient memory use by storing only aggregate site natural parameters (Bui et al., 2015).
SEP-PBP inference involves:
- Computing the cavity distribution by removing the global site.
- Building the tilted distribution using the marginal likelihood for each data point, sequentially moment-matched through hidden layers (probabilistic backpropagation).
- Moment-matching and stochastic updates to global site parameters using learning rate for stability.
- Hyperparameters (lengthscales, variances, inducing locations) are initialized heuristically and tuned by stochastic gradient ascent on the EP energy (approximate marginal likelihood).
The FITC sparse GP approximation is used in each layer, reducing cost from to , and SEP further reduces memory overhead to (Bui et al., 2015).
5. Algorithmic Implementation and Practical Considerations
Algorithmic summaries for PBP and SEP-PBP feature initialization of means and variances to suitable large values or prior parameters, small random perturbations for symmetry breaking, and iterations over data or mini-batches. For each data example, the following steps are performed:
- Forward pass: Propagate means/variances through all layers.
- Compute and top-layer gradients.
- Backward pass: Reverse-mode differentiation to obtain gradients with respect to weights and variances.
- Update moment parameters using EP formulas.
- Hyperparameters: Updated by moment-matching or stochastic gradient ascent.
Sites with negative variance after an update are reverted (EP safeguard). For DGPs, minibatch training and SGD or Adam optimizers further amortize cost across batches. Early stopping based on the EP energy or held-out performance can avoid divergence (Bui et al., 2015, Hernández-Lobato et al., 2015).
Initialization for inducing points uses -means centers (first layer) or mapping near the identity (higher layers). Hyperparameters are initialized using the median heuristic or large lengthscales for conservative generalization.
6. Extensions and Variant Approaches
PBP is extensible to:
- Arbitrary differentiable nonlinear activations via local quadrature or Taylor bounds (Olobatuyi, 2023).
- Deep architectures () by repeated probabilistic forward/backward passes and EP updates.
- Multi-class outputs via softmax likelihoods and local bounds, yielding Gaussian sites on class-score vectors.
VEP-PBP hybridizes EP with variational bound updates for non-conjugate likelihoods (e.g., logistic), introducing variational bounds (Jaakkola–Jordan) and adaptive parameters per data point, solved via moment-matching (Olobatuyi, 2023).
7. Empirical Performance and Uncertainty Calibration
Empirical evaluation of PBP includes regression and binary classification on UCI and other benchmark datasets. In regression, PBP demonstrates:
- Predictive RMSE parity or improvement over standard backpropagation and variational inference, sometimes outperforming them on multiple datasets—e.g., Boston Housing: PBP RMSE=, BP RMSE=, VI RMSE= (Hernández-Lobato et al., 2015).
- Competitive test log-likelihood calibrated predictive variance, with superior uncertainty estimates compared with standard backprop or VI.
- Remarkable scalability: 10–100× faster than BP or VI due to elimination of costly hyperparameter tuning; on very large datasets, runtime is only modestly higher than deterministic backpropagation (Hernández-Lobato et al., 2015).
- Active learning: PBP’s uncertainty estimates approach the benefits of Hamiltonian Monte Carlo and outperform Laplace approximation or batch EP (Hernández-Lobato et al., 2015).
For DGPs trained via SEP-PBP, two-layer architectures outperform single-layer GPs and one-dimensional warping DGPs across multiple datasets; e.g., Boston Housing with DGP(2,50): RMSE=, MLL=; GP(50): RMSE=, MLL= (Bui et al., 2015).
Predictive variances are well-calibrated (tight but non-overconfident), and hierarchical input warping and expansion are discovered automatically. SEP–PBP DGPs avoid local optima via EP’s moment-matching, outperform purely variational ELBO fitting methods, and maintain scalability (Bui et al., 2015).
Summary Table: Regression Performance (Boston Housing Example)
| Method | RMSE ± SE | MLL ± SE |
|---|---|---|
| GP (M=50) | 3.09 ± 0.63 | –2.26 ± 0.31 |
| DGP(1,50) | 2.85 ± 0.65 | –2.30 ± 0.53 |
| DGP(2,50) | 2.47 ± 0.49 | –2.12 ± 0.37 |
References
- Hernández-Lobato & Adams, "Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks" (Hernández-Lobato et al., 2015).
- Bui et al., "Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation" (Bui et al., 2015).
- Li et al., "Stochastic Expectation Propagation."
- Minka, "A Family of Algorithms for Approximate Bayesian Inference."
- "Variational EP with Probabilistic Backpropagation for Bayesian Neural Networks" (Olobatuyi, 2023).