Probabilistic Forward Pass (PFP)

Updated 5 December 2025

Probabilistic Forward Pass (PFP) is an analytic method that computes both predictive means and variances via closed-form moment matching in Bayesian neural networks and particle filters.
It leverages Gaussian approximations of weights and activations, achieving significant speedups (up to 4,200×) compared to Monte Carlo sampling methods while maintaining competitive predictive accuracy.
By incorporating stop-gradient corrections and custom hardware operators, PFP facilitates scalable and memory-efficient Bayesian inference on resource-constrained devices.

The Probabilistic Forward Pass (PFP) refers to an analytic method for propagating uncertainty through the computation graph of models with stochastic or uncertain parameters (e.g., Bayesian neural networks, particle filters) such that both predictive means and variances are computed in closed form at each layer, without recourse to expensive Monte Carlo sampling. PFP has been instantiated in various domains, including Bayesian deep learning via moment propagation, scalable variational inference for neural networks, and differentiable particle filtering, typically yielding scalable, memory-efficient, and hardware-compatible Bayesian inference with tractable uncertainty quantification (Hernández-Lobato et al., 2015, Klein et al., 28 Nov 2025, Ścibior et al., 2021).

1. Mathematical Formulation of Probabilistic Forward Pass

In Bayesian neural network contexts, each weight $w$ and activation $x$ in the network is represented as a univariate Gaussian random variable. At layer $l$ , given inputs $x_j^{l-1} \sim \mathcal{N}(\mu_{x_j^{l-1}}, \sigma^2_{x_j^{l-1}})$ and parameters $w_{ij}^l \sim \mathcal{N}(\mu_{w_{ij}^l}, \sigma^2_{w_{ij}^l})$ , $b_i^l \sim \mathcal{N}(\mu_{b_i^l}, \sigma^2_{b_i^l})$ , the output preactivations and their variances are computed exactly using independence and the properties of Gaussian distributions.

For a fully connected layer: $\mu_{a^l_i} = \sum_j \mu_{w^l_{ij}} \mu_{x_j^{l-1}} + \mu_{b^l_i}$

$\sigma^2_{a^l_i} = \sum_j \left[ \sigma^2_{w^l_{ij}} (\mu^2_{x_j^{l-1}} + \sigma^2_{x_j^{l-1}}) + \mu^2_{w^l_{ij}} \sigma^2_{x_j^{l-1}} \right] + \sigma^2_{b^l_i}$

After a linear transformation, the nonlinearity (often a ReLU) is handled by analytically computing the mean and variance of the output of the nonlinearity under the input Gaussian, using moment-matching. For ReLU, these are given by: $\mu_{x^l_i} = \sigma_{a^l_i} \varphi(\alpha_i) + \mu_{a^l_i} \Phi(\alpha_i)$

$\mathbb{E}[(x^l_i)^2] = (\sigma^2_{a^l_i} + \mu^2_{a^l_i}) \Phi(\alpha_i) + \mu_{a^l_i} \sigma_{a^l_i} \varphi(\alpha_i)$

with $\alpha_i = \mu_{a^l_i}/\sigma_{a^l_i}$ , $\Phi$ the standard normal CDF, and $\varphi$ its PDF (Klein et al., 28 Nov 2025, Hernández-Lobato et al., 2015). By repeating these steps through all layers, a Gaussian distribution is propagated for each layer’s output.

In particle filter contexts, the forward pass (prediction and update) is left unchanged and all states and weights are propagated as in the original algorithm; the central innovation is a “magic-box” weight correction using a stop-gradient operator, ensuring gradients are unbiased when working within an automatic differentiation framework (Ścibior et al., 2021).

2. Probabilistic Forward Pass in Bayesian Neural Networks

In methods such as Probabilistic Backpropagation (PBP) and recent efficient deployments on embedded hardware, PFP is central to scalable training and inference in BNNs. PFP computes analytic Gaussian beliefs on each layer’s outputs through moment matching. During both training and inference, the network propagates not just point estimates but complete mean and variance vectors at each layer, culminating in a predictive distribution at the output. In PBP, this mechanism is leveraged for efficient assumed-density-filtering (ADF) updates for each weight, using derivatives of the log-partition function computed through the PFP (Hernández-Lobato et al., 2015).

Compared to classical stochastic variational inference (SVI), where uncertainty is estimated via sampling from $q(w)$ and running many forward passes per input, PFP requires only a single pass to compute means and variances, yielding drastic speedups with comparable predictive calibration. For Dirty-MNIST, PFP-based BNNs achieve nearly identical accuracy and out-of-distribution (OOD) detection (e.g., AUROC up to 0.858 versus 0.812 for SVI with 30 samples), while being up to 4,200× faster for small batches (Klein et al., 28 Nov 2025).

3. PFP in Differentiable Particle Filtering

The PFP formalism has been extended beyond neural networks, notably to make particle filters fully differentiable with respect to model parameters, even in the presence of discrete operations such as resampling (Ścibior et al., 2021). A naïve approach leads to biased or high-variance gradients due to AD paths through the stochastic resampling step. PFP retains the unmodified particle filter forward pass and inserts a correction via stop-gradient (detach) at key points in the weight calculation. The estimator for the log-marginal likelihood is

$\log \hat{Z}_{\mathrm{PF}} = \sum_{t=1}^T \log \left( \sum_i v_t^i \right)$

where the $v$ weights are computed so as to block gradient flow through the discrete resampling, and all forward computations remain structurally unchanged. This approach yields correct estimators for gradients (via Fisher’s identity) and for higher-order quantities (Hessians via Louis’ identity). Marginal PF (PFP 2) further reduces gradient variance at $\mathcal{O}(N^2)$ cost by using a marginal resampling scheme (Ścibior et al., 2021).

4. Implementation, Operator Design, and Hardware Considerations

Efficient deployment of PFP requires custom operator sets capable of propagating Gaussian moments through common layers and nonlinearities. Recent work extends deep learning compilers (Apache TVM) with operators such as PFP_dense, PFP_conv2d, and PFP_relu, each computing means and variances in joint passes. Optimization exploits manual schedule tuning (loop tiling, reordering, SIMD vectorization) and meta-learning-based auto-scheduling for ARM CPUs (Klein et al., 28 Nov 2025). Performance ablation shows joint operators and raw-moment representations provide up to 1.5× compute savings compared to separate implementations.

In differentiable particle filtering, PFP integration into modern autodiff frameworks (PyTorch, TensorFlow) requires only the strategic application of detach (stop_gradient) to particle weights within the resampling logic. Memory costs remain $\mathcal{O}(N T)$ , and empirical slowdowns are modest ( $\lesssim 10\%$ ) relative to unmodified PF (Ścibior et al., 2021).

5. Assumptions, Trade-Offs, and Limitations

PFP enforces Gaussian (uni-modal, symmetric) approximations for both weights and activations at every layer. This is a more aggressive assumption than SVI, which only constrains the weights, allowing for more complex (non-Gaussian, potentially multi-modal) distributions on activations via sampling. Consequently, PFP may underestimate epistemic uncertainty (mutual information), sometimes by up to 44% in synthetic multi-modal scenarios (Klein et al., 28 Nov 2025). A global scaling factor $\gamma$ is introduced to calibrate variance at test-time to match SVI estimates.

Practical adoption is justified when:

Low-latency or small-batch inference is required on resource-constrained hardware (e.g., IoT, mobile CPUs).
The network architecture and data regime support the validity of the Gaussian activation approximation (e.g., moderate depth, ReLU nonlinearity).
Stochastic inference via SVI is computationally prohibitive, yet credible quantification of predictive uncertainty is required.

Full SVI or heavy-tailed approximations are still preferred where non-Gaussian effects (e.g., discrete gates, strong skewness) dominate or high-fidelity epistemic uncertainty is critical (Klein et al., 28 Nov 2025).

6. Summary of Results and Comparative Performance

Empirical results demonstrate that PFP achieves accuracy and OOD detection nearly identical to sampling-based SVI, with AUROC and overall classification metrics nearly matching for Dirty-MNIST across both MLP and LeNet-5 architectures. On ARM Cortex-A72 and related hardware, PFP achieves up to 4,200× speedup for batch size 1 and 100× or greater improvement for typical batch sizes (versus 30 samples via SVI). PFP models are typically 4–11× slower than deterministic NNs but vastly faster than Monte Carlo sampled Bayesian variants (Klein et al., 28 Nov 2025).

In differentiable particle filtering, PFP returns unbiased and low-variance score estimators via end-to-end AD with no changes to the core filter logic, and all additional computational or memory cost remains linear in particle and time dimensions (Ścibior et al., 2021).

References:

"Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks" (Hernández-Lobato et al., 2015)
"Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation" (Klein et al., 28 Nov 2025)
"Differentiable Particle Filtering without Modifying the Forward Pass" (Ścibior et al., 2021)