BONG: Bayesian Online Natural Gradient

Updated 2 February 2026

BONG is a principled method for sequential Bayesian inference using online natural gradient steps equivalent to extended Kalman filtering.
It employs a variational Bayesian framework with single-step updates, enhancing uncertainty quantification and computational efficiency.
Practical approximations like diagonal and low-rank updates make BONG scalable for high-dimensional problems in Bayesian neural networks and vision-language models.

The Bayesian Online Natural Gradient (BONG) algorithm is a principled approach to sequential Bayesian inference that leverages the natural gradient—i.e., preconditioning by the Fisher Information Matrix (FIM)—to produce efficient, curvature-aware online updates for probabilistic models. BONG’s core insight is the theoretical equivalence between online natural gradient descent and extended Kalman filtering, enabling the integration of second-order information, rigorous uncertainty quantification, and scalable, deterministic approximations for high-dimensional problems such as Bayesian neural networks and fine-tuned vision-LLMs.

1. Theoretical Foundations and Equivalence to Kalman Filtering

BONG is grounded in the observation that natural gradient descent—a learning algorithm that scales parameter updates by the inverse Fisher information—admits an algebraic equivalence to the extended Kalman filter (EKF) in both the i.i.d. and recurrent settings. Specifically, estimating a parameter $\theta \in \mathbb{R}^n$ from a stream of observations via online natural gradient can be recast as applying an EKF to track a hidden state (the parameter vector) in light of new “pseudo-observations.” In this framework, the Fisher information accumulated from the data directly corresponds to the updated precision (inverse covariance) in the Kalman formalism (Ollivier, 2017, Abdi et al., 3 Nov 2025).

In the exponential family, the EKF’s information-filter step is: $P_t^{-1} = P_{t-1}^{-1} + H_t^\top R_t^{-1} H_t = P_{t-1}^{-1} + F_t,$ with $H_t$ the Jacobian of the output w.r.t. $\theta$ and $R_t$ the observation noise covariance. The parameter update takes the form: $\theta_t = \theta_{t-1} - P_t \nabla_\theta \ell_t,$ which matches the natural gradient step $\theta \leftarrow \theta - \eta F^{-1} \nabla \ell$ when $\eta$ and other hyperparameters are aligned.

2. Variational Bayesian Formulation and Single-Step Update

In online variational Bayes, a mean-field or structured Gaussian variational posterior $q_\lambda(\theta)$ is recursively updated as new data arrive. Standard VB minimizes: $-\mathbb{E}_{q_\lambda}[\log p(y_t|x_t,\theta)] + \mathrm{KL}(q_\lambda(\theta)\|\pi_{t|t-1}(\theta)),$ where $\pi_{t|t-1}$ is the prior predictive. BONG simplifies this by performing a single natural gradient ascent step on the expected log-likelihood term, initialized at the prior parameter, and dropping the explicit KL regularizer. For an exponential family,

$\lambda_t = \lambda_{t|t-1} + F(\lambda_{t|t-1})^{-1} \nabla_\lambda \mathbb{E}_{q_{t|t-1}}[\log p(y_t|x_t,\theta)],$

and in the mean parameterization,

$\mu_t = \mu_{t|t-1} + \nabla_\mu \mathbb{E}_{q_{t|t-1}}[\log p(y_t|x_t,\theta)].$

BONG is thus a single-pass, one-step natural gradient mirror descent applied sequentially (Jones et al., 2024).

3. Exactness for Conjugate Exponential-Family Models

In the special case where the variational family and the likelihood form a conjugate exponential family, the BONG update precisely recovers the exact Bayesian posterior in one step. If

$p(y_t|\theta) \propto \exp(s(y_t)^\top T(\theta) - A(T(\theta))),$

with $q_{t|t-1}(\theta)$ also exponential family, then the update is

$\eta_t = \eta_{t|t-1} + \nabla_\eta \mathbb{E}_{q_{t|t-1}} [s(y_t)^\top T(\theta) - A(T(\theta))] = \eta_{t|t-1} + [s(y_t); 1],$

matching the canonical Bayesian filter.

4. Practical Approximations: Gaussian, Diagonal, and Low-Rank BONG

For non-conjugate problems, notably neural network parameterizations, full-covariance computations are infeasible. BONG employs approximations:

Diagonal approximation: Store and update only the diagonal elements of the covariance or Fisher, leading to $\mathcal{O}(n)$ cost (Abdi et al., 3 Nov 2025, Mohan et al., 17 Nov 2025).
Diagonal plus low-rank (DLR): Represent the precision as $\Lambda + W W^\top$ , where $\Lambda$ is diagonal and $W$ is $n \times r$ with $r \ll n$ ; the Fisher/precision updates and associated SVD projections allow memory and computation savings at a small cost in fidelity (Jones et al., 2024).
Monte Carlo and linearized EKF: Estimate the required moments $\mathbb{E}[\nabla \log p], \mathbb{E}[\nabla^2 \log p]$ via sampling or first-order Taylor approximations, respectively.

The BONG step for a Gaussian family is

$m_t = m_{t|t-1} + \Sigma_{t|t-1} \mathbb{E}_{q_{t|t-1}} [\nabla_\theta \log p(y_t|x_t,\theta)],$

$\Sigma_t^{-1} = \Sigma_{t|t-1}^{-1} - \mathbb{E}_{q_{t|t-1}} [\nabla_\theta^2 \log p(y_t|x_t,\theta)].$

For minibatch or Bayesian neural network training, these are computed efficiently via MC or diagonal/dlr approximations (Abdi et al., 3 Nov 2025, Mohan et al., 17 Nov 2025, Jones et al., 2024).

5. Iterative Natural Gradient Filtering in Nonlinear Systems

For highly nonlinear systems, the BONG approach can be extended to iterative, locally optimal natural gradient flows on the manifold of Gaussians. At each step, the update seeks stationary points of a variational objective combining expected loss and KL divergence to the prediction, leveraging the Fisher metrics of both mean and precision blocks. The resulting algorithm—termed NANO—proceeds by repeated natural-gradient steps on: $J(\hat x, P) = \mathbb{E}_{\mathcal{N}(\hat x, P)}[-\log p(y|x)] + D_{\mathrm{KL}}(\mathcal{N}(\hat x, P) \| \mathcal{N}(\hat x_{t|t-1}, P_{t|t-1})),$ with Fisher-based preconditioning on both mean and covariance (Cao et al., 2024). Specializing to the linear-Gaussian case, a single BONG step recovers EKF.

6. Computational Complexity and Scalability

BONG is explicitly designed for computational tractability in high-dimensional inference:

Full-covariance Kalman/natural gradient is $\mathcal{O}(n^2)$ per step, impractical for $n \sim 10^6$ .
Diagonalization reduces cost and storage to $\mathcal{O}(n)$ ; DLR maintains scalability for modest-rank updates.
Per-step complexity can match or marginally exceed first-order optimizers (SGD/Adam), making BONG suitable for online training of neural adapters or large language-vision models (Abdi et al., 3 Nov 2025, Mohan et al., 17 Nov 2025).

7. Uncertainty Quantification, Trust-Region Mechanisms, and Empirical Results

BONG’s maintenance of an updated approximate posterior $q(\theta) = \mathcal{N}(m, P)$ endows each parameter with a calibrated uncertainty. Applications exploit this for Bayesian trust-region regularization, notably scaling updates by a Mahalanobis distance factor when the incoming data are out-of-distribution: $\lambda = e^{-\alpha d_M},$ resulting in enhanced OOD robustness. Experimental results on few-shot vision-language adaptation demonstrate consistent outperformance or parity with first-order baselines in both in-distribution and OOD tasks, with marked improvements under severe domain shift (e.g., +9pp OOD robustness on corrupted ImageNet-C injection) (Abdi et al., 3 Nov 2025). Analogous benefits for BNNs include improved calibration and accelerated convergence compared to variational inference with first-order optimizers (Mohan et al., 17 Nov 2025).

References:

"Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering" (Abdi et al., 3 Nov 2025)
"Nonlinear Bayesian Filtering with Natural Gradient Gaussian Approximation" (Cao et al., 2024)
"Natural gradient descent for improving variational inference based classification of radio galaxies" (Mohan et al., 17 Nov 2025)
"Bayesian Online Natural Gradient (BONG)" (Jones et al., 2024)
"Online Natural Gradient as a Kalman Filter" (Ollivier, 2017)