Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mono-Forward (MF) Training for MLPs

Updated 3 July 2026
  • Mono-Forward (MF) is a strictly forward-only training method that replaces global backpropagation with independent per-layer objectives.
  • It improves efficiency by reducing training time by up to 34% and lowering energy consumption while matching or exceeding backpropagation accuracy.
  • MF unifies representation learning and label supervision using local projection matrices, enhancing both biological plausibility and hardware efficiency.

Mono-Forward (MF) is a backpropagation-free, strictly forward-only training strategy for multi-layer perceptrons (MLPs). It is designed to avoid the architectural, computational, and biological limitations of global error propagation. By assigning each hidden layer an independent classification objective, MF relies solely on local, forward-available signals, updating parameters using only per-layer information without requiring a global backward pass. MF unifies representation learning and label supervision through local projection matrices and demonstrates state-of-the-art classification performance matched or surpassed to classical backpropagation (BP), while offering substantial energy, time, and memory savings on standard hardware (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

1. Core Algorithm and Theoretical Foundations

MF operates on L-layer MLPs, with each hidden layer ll parameterized by a primary weight W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}} and a local projection matrix P(l)RC×hlP^{(l)} \in \mathbb{R}^{C \times h_l}, where CC is the number of classes. The forward computations for a data sample xRh0x \in \mathbb{R}^{h_0} and label y{0,1}Cy \in \{0,1\}^C proceed as follows:

  • For l=1,,Ll=1,\ldots,L, compute
    • z(l)=W(l)a(l1)+b(l)z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
    • a(l)=ϕ(z(l))a^{(l)} = \phi(z^{(l)}), where ϕ\phi is typically ReLU or similar
    • W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}0

Each layer computes its own supervised cross-entropy loss: W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}1 Local gradients for a single layer are computed and applied: W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}2

W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}3

Critically, no backward pass through other layers is required; each layer minimizes its own W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}4 independently and updates W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}5, W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}6 using SGD, AdamW, or similar optimizers. Inference is performed using the final layer's class-scores or via aggregation across layers (Spyra et al., 2 Nov 2025, Gong et al., 16 Jan 2025).

2. Algorithmic Workflow and Practical Steps

A canonical training loop in MF for an MLP of depth W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}7 is as follows: W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}9 All layers proceed in a strictly forward manner—there is no global backward pass or necessity for storage of all intermediate activations, simplifying memory requirements (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).

3. Relation to FF and CaFo Algorithms

MF emerged as the synthesis of earlier forward-only strategies:

  • Forward-Forward (FF): Relies on positive/negative parallel passes and a “goodness” metric (sum of squared activations), requires label embedding in inputs and normalization for efficacy, and converges slowly due to the lack of a direct label-targeted signal.
  • Cascaded Forward (CaFo): Employs local block-level classifiers in CNNs, with random or feedback-alignment-trained features; local cross-entropy is only applied at block boundaries. Randomized variants are memory-efficient but suffer an accuracy loss, while feedback-aligned variants reduce the loss at high compute cost.
  • Mono-Forward (MF): Assigns per-layer local projections/classifiers (W(l)Rhl×hl1W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}8). Each layer is jointly supervised, enabling rapid convergence and improved generalization, with no reliance on negative samples, feedback alignment, or intricate block pretraining.

This progression consolidates MF as the most effective, hardware-friendly approach for MLPs, outperforming both predecessors in accuracy, energy, and time (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

4. Experimental Protocols and Benchmarks

All MF evaluations are conducted under rigorous experimental controls:

  • Architectures: MNIST and Fashion-MNIST (2 × 1000 MLPs), CIFAR-10/CIFAR-100 (3 × 2000 MLPs).
  • Hyperparameter tuning: Universal search with Optuna across learning rate, weight decay, optimizer (AdamW recommended), batch size.
  • Early stopping: Validation accuracy monitored using the Prechelt criterion.
  • Hardware: NVIDIA A100 (40 GB) GPUs, PyTorch v2.4.0, CUDA 12.4, NVML for memory/energy, CodeCarbon for CO₂e.
  • Metrics: Test accuracy, training time (per epoch/total), energy (Wh), peak memory (MiB), FLOPs (forward), and estimated carbon emissions (g CO₂e).

This setup ensures fair, reproducible comparisons with backpropagation and other BP-free baselines, always on identical architectures (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).

5. Empirical Performance and Efficiency Gains

MF consistently matches or exceeds BP's test accuracy, with training and measurement results summarized below (mean over multiple runs):

Dataset Architecture Algorithm Accuracy (%) Time (s) Energy (Wh) CO₂e (g) Peak Mem (MiB)
MNIST 2×1000 MLP MF 98.14 35.0 0.60 1.15 934
BP 98.05 39.8 0.69 1.30 926
F-MNIST 2×1000 MLP MF 89.72 52.1 0.86 1.64 934
BP 89.21 44.1 0.79 1.62 926
CIFAR-10 3×2000 MLP MF 62.34 177.7 3.17 6.70 1120
BP 61.13 268.5 5.35 11.03 1184
CIFAR-100 3×2000 MLP MF 30.31 110.4 2.02 3.46 1142
BP 29.94 111.8 2.30 3.29 1192

MF achieves up to 1.2 pp higher test accuracy (CIFAR-10), 34% reduction in training time, and 41% reduction in energy consumption relative to BP. Peak memory demand is reduced by up to 5.4% on wide MLPs, despite the per-layer projection-matrix overhead; small models can incur a ~1% memory increase (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

6. Efficiency, Convergence Behavior, and Biological Plausibility

MF’s strictly local, forward-only updates provide several advantages:

  • Convergence: MF’s per-layer cross-entropy acts as a strong regularizer. Each layer optimizes a classification subproblem, resulting in lower final validation losses than BP and improved minimums in the loss landscape. The effect is especially pronounced for deep or wide MLPs (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).
  • Parallelizability and Hardware Utilization: Updates can be applied as soon as local activations are computed, supporting asynchrony and improved parallelism. GPU profiling indicates lower but more consistent streaming multiprocessor utilization, shorter training times, and reduced latency versus BP (Gong et al., 16 Jan 2025).
  • Memory Profile: The ‘forward-only’ property obviates the requirement to cache activations for global gradient backpropagation, making MF suitable for memory-constrained hardware and potential custom accelerator designs (Spyra et al., 2 Nov 2025).
  • Biological Plausibility: MF requires no symmetric weight transport or global error signals, with each layer able to execute updates using only locally available information—aligning more closely with synaptic plasticity as observed in biological learning (Gong et al., 16 Jan 2025).

7. Limitations, Extensions, and Research Directions

The current validation of MF is restricted to MLP architectures, with all performance, energy, and memory claims established on this architecture class. Extension to convolutional and transformer-based networks, along with hierarchical or block-wise MF variants for scalability, remains open. Research into parameter-efficient forms of the projection matrices, alternative per-layer objectives, and modification for recurrent or graph neural networks is ongoing (Spyra et al., 2 Nov 2025, Gong et al., 16 Jan 2025).

A plausible implication is that while MF yields pronounced advantages for fully connected MLPs, effective generalization to domains dominated by convolutional or attention-based networks (e.g., large-scale image and sequence modeling) may require significant architectural adaptation or hybridization.


References:

(Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mono-Forward (MF).