Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$μ$PC: Scaling Predictive Coding to 100+ Layer Networks (2505.13124v1)

Published 19 May 2025 in cs.LG, cs.AI, and cs.NE

Abstract: The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth-$\mu$P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call "$\mu$PC". Through an extensive analysis of the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, $\mu$PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, $\mu$PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results have implications for other local algorithms and could be extended to convolutional and transformer architectures. Code for $\mu$PC is made available as part of a JAX library for PCNs at https://github.com/thebuckleylab/jpc (Innocenti et al., 2024).

Summary

  • The paper presents R PC, a novel parameterization based on Depth- R P that enables stable training of predictive coding networks at depths over 100 layers.
  • Key findings include stable training of deep networks on standard tasks, competitive performance, and the ability to transfer optimal learning rates without retraining.
  • Theoretical analysis shows R PC's energy function converges to the backpropagation MSE loss in wide networks, suggesting a link between these approaches.

Scaling Predictive Coding to Deep Network Architectures

The paper "μ\muPC: Scaling Predictive Coding to 100+ Layer Networks" presents a novel parameterization approach to enable the training of predictive coding networks (PCNs) at large depths, overcoming a significant limitation faced by traditional biologically inspired learning algorithms. The method introduced, termed μ\muPC, leverages insights from the maximal update parameterization (Depth-μ\muP) to enhance the stability and scalability of PCNs, allowing them to train effectively across depths exceeding 100 layers.

Core Contributions and Findings

The authors address the challenge of scaling PCNs by introducing a new parameterization strategy that enables reliable training of deep networks while mitigating issues associated with the ill-conditioning of the inference landscape and the instability of forward passes in deep models. The primary contributions are as follows:

  1. Depth-μ\muP Reparameterization: The paper proposes μ\muPC, a parameterization that applies Depth-μ\muP scaling to PCNs, enabling the training of networks with over 100 layers. This parameterization addresses the instability issues in standard PCNs by ensuring stable forward passes at initialization, independent of network depth.
  2. Empirical Performance: Through rigorous experimentation, the authors demonstrate that μ\muPC facilitates the stable training of very deep residual networks on standard classification tasks like MNIST and Fashion-MNIST, achieving competitive performance with minimal hyperparameter tuning compared to existing benchmarks.
  3. Zero-shot Hyperparameter Transfer: An intriguing property of μ\muPC is the ability to transfer optimal learning rates across different network widths and depths without additional tuning. This capability significantly reduces the computational cost associated with hyperparameter optimization in deep networks.
  4. Theoretical Convergence to Backpropagation (BP): The paper presents a theoretical analysis showing that in the limit where network width vastly exceeds depth, the equilibrated energy of μ\muPC converges to the mean squared error (MSE) loss used in BP. This result highlights a regime where μ\muPC effectively approximates BP, providing a theoretical underpinning for its robust performance.
  5. Ill-conditioning and Performance: Despite not fully solving the ill-conditioning problem of the inference landscape, μ\muPC's success suggests that achieving a stable forward pass is crucial for the trainability of deep networks. The research indicates that while PCNs inherently face ill-conditioning, stable initialization and certain iterative inference corrections enable training success.

Implications and Future Directions

The results presented in this paper have substantial implications for the development of biologically plausible learning algorithms that can scale to the depths required by modern computational tasks. By solving key scalability issues, μ\muPC opens avenues for further exploration of predictive coding methods in large-scale settings, potentially bridging the gap between the biological plausibility of local learning algorithms and the performance of gradient-based methods like BP. Future work could extend these findings to more complex architectures, such as convolutional networks or transformers, and explore the computational efficiencies offered by μ\muPC in different hardware settings.

Additionally, understanding the dynamics and theoretical properties of μ\muPC further, especially in nonlinear settings and throughout training, could yield deeper insights into optimizing local learning rules in neural networks. This research not only provides a practical pathway for scaling predictive coding networks but also enriches the understanding of how neural-inspired learning approaches can compete with traditional deep learning methodologies in terms of scalability and efficiency.

Github Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com