Differentiable Generalised Predictive Coding

Updated 6 January 2026

DGPC is a framework that generalizes classical predictive coding by integrating differentiable operations into arbitrary computation graphs and generative models.
It synergizes hierarchical Bayesian inference with local message passing and gradient-based updates to enable effective end-to-end learning.
DGPC achieves competitive performance with backpropagation while offering biologically plausible, local learning rules and iterative inference mechanisms.

Differentiable Generalised Predictive Coding (DGPC) refers to a class of learning and inference algorithms wherein predictive coding principles are generalized to arbitrary architectures, probability distributions, and optimization settings while retaining full differentiability for end-to-end training. DGPC unifies hierarchical Bayesian modeling, local message-passing, and distributed gradient descent, providing a biologically plausible yet practically competitive alternative to standard backpropagation in deep learning.

1. Theoretical Framework and Generative Model Foundations

DGPC generalizes the classical predictive coding (PC) paradigm, which minimizes hierarchically organized prediction errors via local message-passing and Hebbian-like plasticity, to arbitrary differentiable computational graphs and generative models. The fundamental structure involves a layered or graph-structured network of latent variables and observed data, governed by probabilistic generative models of the form: $p(x_{0:L};\theta) = p(x_0)\prod_{l=1}^L p(x_l \mid x_{l-1};\theta_l)$ where $x_{L}$ is observed (data or target), $x_{0}$ may be a deterministic or learnable prior, and each conditional $p(x_l \mid x_{l-1};\theta_l)$ is an arbitrary tractable parametric family (e.g., Gaussian, categorical, etc.) (Pinchetti et al., 2022).

In the Gaussian case, the joint log-density is quadratic: $\log p(\mathbf{x},\mathbf{z}\mid\theta) = -\tfrac{1}{2}\|\mathbf{z}\|^2 - \tfrac{1}{2\sigma^2} \|\mathbf{x} - f_\theta(\mathbf{z})\|^2$ with $f_\theta$ a deep neural network parameterized by $\theta$ (Zahid et al., 2023). Generalizations to non-Gaussian likelihoods and priors, including categorical and mixture models, are also standard in DGPC (Pinchetti et al., 2022).

Key architectural extensions include:

Arbitrarily deep hierarchies or directed acyclic computation graphs, permitting architectures such as CNNs, RNNs, LSTMs, transformers, and more (Millidge et al., 2020, Salvatori et al., 2021).
Modular, nodal structures facilitating the specification of hierarchical and dynamical predictions, as in the generalized Hierarchical Gaussian Filter (HGF) (Weber et al., 2023) and dynamical GPC (Ofner et al., 2021).

2. Inference Dynamics and Energy Functionals

Inference in DGPC is cast as minimization of an energy or free-energy functional $F$ associated with the generative model and parametrization: $F = \sum_{\ell} \tfrac{1}{2} \|e_\ell\|^2 \quad \text{with} \quad e_\ell = a_\ell - f_\ell(a_{\ell-1};\theta_\ell)$ for layered networks, or, more generally, a sum over squared prediction errors or local Kullback-Leibler divergences at each node or layer (Millidge et al., 2022, Pinchetti et al., 2022). For non-Gaussian architectures, the energy term is generalized to KL divergences: $F_{\mathrm{KL}}(\phi, \theta) = \sum_{l=0}^{L} D_{KL}[\mathcal X_l(\phi_l^\mathcal D)\| \widehat{\mathcal X}_l(f_l(\phi_{l-1}^\mathcal D;\theta_l))]$ where $\phi_l^\mathcal D$ encodes variational parameters at layer $l$ (Pinchetti et al., 2022).

Latent activities are iteratively updated by gradient descent on $F$ : $a_\ell^{(t+1)} = a_\ell^{(t)} - \eta \frac{\partial F}{\partial a_\ell}|_{a^{(t)}}$ This yields bidirectional message passing—bottom-up error drives and top-down correction—within general computation graphs (Millidge et al., 2022, Millidge et al., 2020).

Stability and convergence in arbitrary graphs are maintained by ensuring appropriate step size choices, graph-level synchronization, and use of fixed-prediction assumptions within the Laplace mean-field variational framework (Millidge et al., 2020, Salvatori et al., 2021).

3. Learning Rules for Model Parameters

Once inference converges, DGPC applies local plasticity rules for synaptic weights by gradient descent on $F$ : $\Delta \theta_\ell = -\alpha \frac{\partial F}{\partial \theta_\ell}$ For parameter-linear mappings (e.g., $\sigma(\theta_\ell u_{\ell-1})$ ), these updates reduce to local Hebbian rules—error times presynaptic activity—further enhancing biological plausibility (Millidge et al., 2022, Pinchetti et al., 2022). In general, the learning step alternates with inference as a generalized expectation-maximization scheme, with convergence guarantees under mild smoothness and step-size assumptions (Millidge et al., 2022).

In probabilistic generative models, DGPC learning steps optimize tight evidence lower bounds (ELBOs). When incorporating sampling-based (Langevin) inference, as in "Sample as You Infer" (Zahid et al., 2023), optimizing with respect to the average of $\log p(x,z|\theta)$ over approximate posterior samples yields a variational EM-style learning procedure.

4. Extensions: Distributional Generalization, Dynamics, and Amortized Inference

DGPC frameworks allow distributional generalization beyond Gaussian priors and likelihoods. At each layer, prediction errors are replaced by local KL divergences or tractable cross-entropic divergences appropriate to the parametric family (e.g., softmax/categorical for attention layers in transformers) (Pinchetti et al., 2022).

Dynamical DGPC introduces explicit modeling of time (or spatial) derivatives alongside hierarchical structure. Each layer can maintain internal states subject to both top-down (hierarchical) and temporal (dynamical) predictions, with learned timescales or sampling distances as additional learnable latent variables (Ofner et al., 2021):

$\varepsilon_h^{(l)} = [\Sigma_h^{(l)}]^{-1}(s^{(l)} - f^{(l)}(s^{(l+1)};\theta_h^{(l)})), \quad \varepsilon_d^{(l)} = [\Sigma_d^{(l)}]^{-1}(s^{(l)} - g^{(l)}(s^{(l)}_{t-dt^{(l)}}, dt^{(l)}; \theta_d^{(l)}))$

The total energy $E$ includes both terms and is minimized using automatic differentiation and standard optimizers.

For efficient inference, amortized encoders (neural networks $q_\phi(z|x)$ ) can initialize latent states close to high-probability regions. Encoder parameters are trained using forward/reverse KL or Jeffreys divergence objectives, trading off bias and variance in posterior approximation (Zahid et al., 2023).

Stochastic inference via Langevin dynamics (gradient-based updates with injected Gaussian noise) converts point estimation into sampling, providing robust ELBO maximization and improved mixing properties. Preconditioning, inspired by Riemann manifold Langevin and adaptive optimizers, further improves stability and insensitivity to step size (Zahid et al., 2023).

5. Algorithmic Implementation and Architectures

All core components of DGPC inference and learning—layer-wise error calculation, local state updates, and parameter gradient steps—are expressed as compositions of differentiable operations. This enables integration with automatic differentiation frameworks (PyTorch, TensorFlow) and application to deep architectures (Ofner et al., 2021, Ororbia et al., 2022).

Table: DGPC Algorithmic Phases

Phase	Operation	Update Rule/Objective
Inference (E-step)	Update latent activities to minimize energy	$a_\ell \leftarrow a_\ell - \eta\ \partial F/\partial a_\ell$
Learning (M-step)	Update parameters using inferred latent activities	$\theta_\ell \leftarrow \theta_\ell - \alpha\ \partial F/\partial \theta_\ell$

Both fully connected and convolutional architectures are supported. In convolutional neural generative coding (Conv-NGC), convolutional/deconvolutional operators replace fully connected mappings, error maps are computed per feature map, and iterative state refinement is used to achieve competitive or superior reconstruction and out-of-distribution generalization relative to standard autoencoders (Ororbia et al., 2022).

For temporal and sequential tasks, extensions to RNNs, LSTMs, and dynamical architectures are directly available, with their PC equivalents converging to (or exactly matching, in the case of Z-IL (Salvatori et al., 2021)) the gradients of backpropagation through time (Millidge et al., 2020, Salvatori et al., 2021).

6. Empirical Properties and Performance

Empirically, DGPC matches or closely tracks the performance of standard backpropagation across diverse architectures and tasks:

In supervised classification (MNIST, CIFAR-10), DGPC matches BP test accuracy curves within $1$– $2\%$ (Pinchetti et al., 2022).
For variational autoencoding, DGPC achieves comparable ELBOs, successful unimpaired reconstructions, and appropriate representation of latent variances (Pinchetti et al., 2022).
In transformer-based conditional language modeling, DGPC achieves perplexities ( $\approx 200$ ) within 15% of BP, while original Gaussian PC is substantially inferior (Pinchetti et al., 2022).
Convolutional predictive coding architectures (Conv-NGC) outperform similarly sized convolutional autoencoders on out-of-distribution reconstruction, achieving higher SSIM and lower MSE with fewer parameters (Ororbia et al., 2022).
Langevinized PC with Jeffreys objective and preconditioning achieves FID $\approx 39.6$ on SVHN (VAE baseline $\sim53.9$ ) and converges in $3$–$10$ epochs versus $50$ for VAE (Zahid et al., 2023).

DGPC exhibits high robustness under online, few-shot, and continual learning scenarios and provides local, biologically plausible credit assignment in both layered and arbitrarily branched architectures.

7. Neurobiological Plausibility, Extensibility, and Open Directions

A core feature of DGPC is local, parallel message-passing and plasticity, without hard requirements for weight transport or global synchrony. The inference and learning dynamics operate with local (synaptic) information: prediction errors, pre- and post-synaptic activity, and locally accessible derivatives. This structure is congruent with canonical cortical microcircuit models positing distinct prediction and prediction-error populations (Ofner et al., 2021). Extensions via Markov blanket partitions allow the generalization of functional modules to arbitrary graph motifs (Ofner et al., 2021).

DGPC is directly extensible to:

Hierarchical latent variable models and continuous-discrete variable interfaces (Zahid et al., 2023).
Temporal models with adaptable sampling/observation intervals (Ofner et al., 2021).
Modular architectures, including multi-scale and spatio-temporal hierarchies (Weber et al., 2023, Ofner et al., 2021).
Downstream tasks such as image inpainting, conditional generation, perception-planning loops, and policy selection in active inference frameworks.

Current limitations include the computational overhead of iterative inference (versus single-pass BP), memory requirements for maintaining separate states and predictions, and the need for tractable gradients or samples for generalized distributional divergences (Pinchetti et al., 2022). Nevertheless, DGPC algorithms are competitive or superior in quality, convergence speed, and generalizability compared to strong BP-trained baselines, and are directly enabled for neuromorphic or distributed parallel execution.

References:

(Zahid et al., 2023) Sample as You Infer: Predictive Coding With Langevin Dynamics (Pinchetti et al., 2022) Predictive Coding beyond Gaussian Distributions (Millidge et al., 2022) A Theoretical Framework for Inference and Learning in Predictive Coding Networks (Salvatori et al., 2021) Reverse Differentiation via Predictive Coding (Millidge et al., 2020) Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Ofner et al., 2021) Differentiable Generalised Predictive Coding (Weber et al., 2023) The generalized Hierarchical Gaussian Filter (Ororbia et al., 2022) Convolutional Neural Generative Coding: Scaling Predictive Coding to Natural Images