Papers
Topics
Authors
Recent
Search
2000 character limit reached

Variational Predictive Coding

Updated 5 January 2026
  • Predictive coding is a framework that uses variational Bayesian inference to minimize free energy and drive hierarchical error correction.
  • It employs local update rules, including Hebbian and Langevin sampling strategies, to maintain biological plausibility while quantifying uncertainty.
  • The approach unifies probabilistic inference with self-supervised learning, enhancing applications in speech, vision, and neuroscientific modeling.

Predictive Coding Under a Variational View

Predictive coding (PC) is a computational framework that models information processing as hierarchical inference under a generative model, in which predictions are continuously compared to sensory or subordinate inputs through precision-weighted error signaling. When cast under the variational perspective, predictive coding is revealed as a special instance of variational Bayesian inference, often realized by minimization of variational free energy or evidence lower bound (ELBO). This variational formulation provides a rigorous, unifying foundation for PC algorithms, bridging classical Bayesian inference, the information bottleneck principle, and deep generative architectures such as variational autoencoders (VAEs). The variational view is crucial for extending predictive coding to modern probabilistic machine learning tasks, quantifying uncertainty, and enabling biologically plausible local learning.

1. Variational Foundations of Predictive Coding

At the core of variational predictive coding lies the minimization of free energy for hierarchical latent-variable generative models. For a typical LL-layer model with latent variables z0,x1,,xLz_0, x_1, \ldots, x_L (with xlzlx_l\equiv z_l) and parameters θ={Wl,Σl}l=1L\theta = \{W_l, \Sigma_l\}_{l=1}^L, the joint density is

p(x0,x1,,xL,θ)=p(x0)l=1Lp(xlxl1,Wl,Σl)p(Wl,Σl)p(x_0, x_1, \ldots, x_L, \theta) = p(x_0) \prod_{l=1}^L p(x_l \mid x_{l-1}, W_l, \Sigma_l) \cdot p(W_l, \Sigma_l)

with each p(xlxl1,Wl,Σl)p(x_l \mid x_{l-1}, W_l, \Sigma_l) Gaussian. Variational inference proceeds by introducing an approximate posterior q(x,θ)=q(x)q(θ)q(x, \theta) = q(x) q(\theta) and minimizing the variational free energy (negative ELBO)

F[q]=q(x,θ)logq(x,θ)p(y,x,θ)dxdθF[q] = \int q(x, \theta) \log\frac{q(x, \theta)}{p(y, x, \theta)} dx d\theta

which, after decomposition, yields the canonical ELBO

F[q]=Eq(x,θ)[logp(yx,θ)]+KL[q(x)p(x)]+KL[q(θ)p(θ)]F[q] = -\mathbb{E}_{q(x, \theta)}[\log p(y\mid x, \theta)] + \mathrm{KL}[q(x) \Vert p(x)] + \mathrm{KL}[q(\theta)\Vert p(\theta)]

(Tschantz et al., 31 Mar 2025, Millidge et al., 2021, Salvatori et al., 2023).

2. Standard Predictive Coding: MAP/ML Regime

Classical predictive coding algorithms adopt delta-approximate posteriors:

  • q(x)=δ(xx)q(x) = \delta(x - x^*) (MAP inference for latent states)
  • q(θ)=δ(θθ)q(\theta) = \delta(\theta - \theta^*) (maximum likelihood for parameters)

Minimizing the free energy in this regime yields local, neurally plausible update rules. Prediction errors ϵlxlWlf(xl1)\epsilon_l \equiv x_l - W_l f(x_{l-1}) drive inference dynamics via gradient descent: xlxlαExlx_l \leftarrow x_l - \alpha \frac{\partial E}{\partial x_l} where

Exl=Σl1(xlWlf(xl1))DlWl+1TΣl+11(xl+1Wl+1f(xl))\frac{\partial E}{\partial x_l} = \Sigma_l^{-1}(x_l - W_l f(x_{l-1})) - D_l W_{l+1}^T \Sigma_{l+1}^{-1} (x_{l+1} - W_{l+1} f(x_l))

(Local update). Parameters are updated Hebbian-style: ΔWlΣl1(xlWlf(xl1))f(xl1)T\Delta W_l \propto \Sigma_l^{-1}(x_l - W_l f(x_{l-1})) f(x_{l-1})^T These quantities are strictly local: only pre- and post-synaptic activity and local errors are required (Tschantz et al., 31 Mar 2025, Millidge et al., 2021).

3. Fully Variational/Bayesian Predictive Coding Extensions

Bayesian Predictive Coding (BPC) generalizes PC by retaining q(x)=δ(xx)q(x) = \delta(x - x^*) but promoting q(θ)q(\theta) to a full variational posterior, specifically a Matrix-Normal–Wishart distribution for each (Wl,Σl)(W_l, \Sigma_l): q(Wl,Σl)=NW(WlMl,Vl,Σl1)W(Σl1Ψl,νl)q(W_l, \Sigma_l) = \mathcal{N}_W(W_l \mid M_l, V_l, \Sigma_l^{-1}) \cdot \mathcal{W}(\Sigma_l^{-1} \mid \Psi_l, \nu_l) Thanks to conjugacy, closed-form Hebbian updates emerge for Ml,Vl,Ψl,νlM_l, V_l, \Psi_l, \nu_l by accumulating sufficient statistics over posterior samples xx^*: Vl1Vl(0)1+nf(xl1n)f(xl1n)TV_l^{-1} \leftarrow V_l^{(0)-1} + \sum_n f(x_{l-1}^{*n}) f(x_{l-1}^{*n})^T

MlVl[Ml(0)Vl(0)1+nf(xl1n)(xln)T]M_l \leftarrow V_l^*\left[M_l^{(0)} V_l^{(0)-1} + \sum_n f(x_{l-1}^{*n})(x_l^{*n})^T\right]

Crucially, this Bayesian extension preserves the locality and biological plausibility of PC while providing uncertainty quantification—aleatoric via propagation through q(θ)q(\theta), epistemic via posterior sampling (Tschantz et al., 31 Mar 2025).

4. The Predictive Information Bottleneck and Mutual Information View

Variational predictive coding is naturally interpreted under the predictive information bottleneck (PIB) framework. Here, one seeks an encoder q(zx)q(z|x) that compresses XX while maximizing predictive information about YY: LPIB[q]=I(Z;X)βI(Z;Y)\mathcal{L}_{\mathrm{PIB}}[q] = I(Z; X) - \beta I(Z; Y) which, for suitable variational decoders pθ(yz)p_\theta(y|z) and tractable reference r(z)r(z), yields

Lvar(q,pθ)=Ep(x,y)[Eq(zx)[βlogpθ(yz)]+KL(q(zx)r(z))]\mathcal{L}_{\mathrm{var}}(q, p_\theta) = \mathbb{E}_{p(x, y)}\left[\mathbb{E}_{q(z|x)}[-\beta \log p_\theta(y|z)] + \mathrm{KL}(q(z|x)\Vert r(z))\right]

This unifies classical Bayesian inference (β=1\beta=1) and modern self-supervised objectives. The predictive coding loop—prediction, comparison, and update by propagating error—emerges as a message-passing implementation of this bound (Alemi, 2019, Meng et al., 2022).

5. Algorithmic Advances: Structured Graphs, Sampling, and Curvature

  • Structured models: Divide-and-Conquer Predictive Coding (DCPC) extends PC to general graphical models, updating each latent coordinate by Langevin proposals drawn from its exact complete conditional and employing particle-based variational approximations. This respects inter-variable correlations and produces provably correct variational and maximum-likelihood updates with local computations (Sennesh et al., 2024).
  • Langevin sampling: Injection of Gaussian noise in predictive-coding inference recasts it as Langevin MCMC. This enables direct sampling from the latent posterior, tightening the ELBO and improving robustness. Encoder amortization and warm starts further accelerate mixing (Zahid et al., 2023).
  • Curvature correction: Standard PC omits the Hessian (entropy) term present in the Laplace variational Bayes approximation, which regularizes sharpness and prevents over-certainty. Monte Carlo-estimated ELBOs using curvature-sensitive sampling and block-diagonal Hessian approximations recover calibrated uncertainty and improve both likelihood and sample diversity (Zahid et al., 2023).
Algorithm Variational Approx. Locality Uncertainty Quantification Reference
Classic PC MAP/ML, Dirac q()q(\cdot) Yes No (Tschantz et al., 31 Mar 2025)
Bayesian PC (BPC) MAP q(x)q(x), full q(θ)q(\theta) Yes Yes (Tschantz et al., 31 Mar 2025)
DCPC Particle q(z)q(z) Yes Yes (Sennesh et al., 2024)
Laplace-MC PC Gaussian q(z)q(z) w/ Hessian Yes/Approx Yes (curvature-consistent) (Zahid et al., 2023)
Langevin PC Sampled q(z)q(z) (Langevin) Yes Yes (Zahid et al., 2023)

6. Application Domains and Empirical Insights

  • Speech and visual SSL: The variational predictive coding framework underlies and unifies widely-used self-supervised learning objectives including HuBERT, APC, CPC, wav2vec, and BEST-RQ. Extensions such as entropy-maximizing soft assignments and Gumbel-Softmax sampling yield improved pretraining ELBOs and superior downstream performance in phone classification, F0 tracking, speaker recognition, and ASR, demonstrating the practical power of the variational formulation (Yeh et al., 31 Dec 2025).
  • Time-series and neuroscience: Variational predictive coding methods, such as CPIC, exploit mutual information bounds and stochastic encoders to robustly extract low-dimensional, maximally predictive representations from noisy high-dimensional dynamics, outperforming conventional deterministic methods especially under severe noise (Meng et al., 2022).
  • Recurrent and robotic models: Variational PC-RNNs employ meta-priors to interpolate between deterministic chaos and stochastic generation, with optimal generalization at intermediate settings. These frameworks enable realistic mental simulation and efficient planning with working memory and attention (Ahmadi et al., 2018, Jung et al., 2019, Ahmadi et al., 2017).

7. Theoretical Significance and Biological Plausibility

The variational view of predictive coding provides a formal equivalence between PC, variational inference, the information bottleneck, and Bayesian learning. It underpins both the neurobiological plausibility of error-driven local learning (as hypothesized in cortical columns) and the development of scalable, uncertainty-aware deep learning algorithms with local, Hebbian updates. These insights clarify the deep connection between cortical computation and contemporary machine learning objectives, and inform ongoing research into biologically motivated credit assignment, robust online learning, and self-supervised representation learning (Marino, 2020, Salvatori et al., 2023, Millidge et al., 2021).

In summary, predictive coding under the variational view serves as a mathematically rigorous, biologically plausible, and computationally powerful framework, unifying multiple paradigms in statistical inference, neural computation, and modern machine learning (Tschantz et al., 31 Mar 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Predictive Coding Under a Variational View.