Mono-Forward (MF) Training for MLPs

Updated 3 July 2026

Mono-Forward (MF) is a strictly forward-only training method that replaces global backpropagation with independent per-layer objectives.
It improves efficiency by reducing training time by up to 34% and lowering energy consumption while matching or exceeding backpropagation accuracy.
MF unifies representation learning and label supervision using local projection matrices, enhancing both biological plausibility and hardware efficiency.

Mono-Forward (MF) is a backpropagation-free, strictly forward-only training strategy for multi-layer perceptrons (MLPs). It is designed to avoid the architectural, computational, and biological limitations of global error propagation. By assigning each hidden layer an independent classification objective, MF relies solely on local, forward-available signals, updating parameters using only per-layer information without requiring a global backward pass. MF unifies representation learning and label supervision through local projection matrices and demonstrates state-of-the-art classification performance matched or surpassed to classical backpropagation (BP), while offering substantial energy, time, and memory savings on standard hardware (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

1. Core Algorithm and Theoretical Foundations

MF operates on L-layer MLPs, with each hidden layer $l$ parameterized by a primary weight $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ and a local projection matrix $P^{(l)} \in \mathbb{R}^{C \times h_l}$ , where $C$ is the number of classes. The forward computations for a data sample $x \in \mathbb{R}^{h_0}$ and label $y \in \{0,1\}^C$ proceed as follows:

For $l=1,\ldots,L$ $l = 1, \dots, L$ , compute
- $z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$
- $a^{(l)} = \phi(z^{(l)})$ , where $\phi$ is typically ReLU or similar
- $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 0

Each layer computes its own supervised cross-entropy loss: $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 1 Local gradients for a single layer are computed and applied: $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 2

$W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 3

Critically, no backward pass through other layers is required; each layer minimizes its own $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 4 independently and updates $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 5, $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 6 using SGD, AdamW, or similar optimizers. Inference is performed using the final layer's class-scores or via aggregation across layers (Spyra et al., 2 Nov 2025, Gong et al., 16 Jan 2025).

2. Algorithmic Workflow and Practical Steps

A canonical training loop in MF for an MLP of depth $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 7 is as follows: $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 9 All layers proceed in a strictly forward manner—there is no global backward pass or necessity for storage of all intermediate activations, simplifying memory requirements (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).

3. Relation to FF and CaFo Algorithms

MF emerged as the synthesis of earlier forward-only strategies:

Forward-Forward (FF): Relies on positive/negative parallel passes and a “goodness” metric (sum of squared activations), requires label embedding in inputs and normalization for efficacy, and converges slowly due to the lack of a direct label-targeted signal.
Cascaded Forward (CaFo): Employs local block-level classifiers in CNNs, with random or feedback-alignment-trained features; local cross-entropy is only applied at block boundaries. Randomized variants are memory-efficient but suffer an accuracy loss, while feedback-aligned variants reduce the loss at high compute cost.
Mono-Forward (MF): Assigns per-layer local projections/classifiers ( $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ 8). Each layer is jointly supervised, enabling rapid convergence and improved generalization, with no reliance on negative samples, feedback alignment, or intricate block pretraining.

This progression consolidates MF as the most effective, hardware-friendly approach for MLPs, outperforming both predecessors in accuracy, energy, and time (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

4. Experimental Protocols and Benchmarks

All MF evaluations are conducted under rigorous experimental controls:

Architectures: MNIST and Fashion-MNIST (2 × 1000 MLPs), CIFAR-10/CIFAR-100 (3 × 2000 MLPs).
Hyperparameter tuning: Universal search with Optuna across learning rate, weight decay, optimizer (AdamW recommended), batch size.
Early stopping: Validation accuracy monitored using the Prechelt criterion.
Hardware: NVIDIA A100 (40 GB) GPUs, PyTorch v2.4.0, CUDA 12.4, NVML for memory/energy, CodeCarbon for CO₂e.
Metrics: Test accuracy, training time (per epoch/total), energy (Wh), peak memory (MiB), FLOPs (forward), and estimated carbon emissions (g CO₂e).

This setup ensures fair, reproducible comparisons with backpropagation and other BP-free baselines, always on identical architectures (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).

5. Empirical Performance and Efficiency Gains

MF consistently matches or exceeds BP's test accuracy, with training and measurement results summarized below (mean over multiple runs):

Dataset	Architecture	Algorithm	Accuracy (%)	Time (s)	Energy (Wh)	CO₂e (g)	Peak Mem (MiB)
MNIST	2×1000 MLP	MF	98.14	35.0	0.60	1.15	934
		BP	98.05	39.8	0.69	1.30	926
F-MNIST	2×1000 MLP	MF	89.72	52.1	0.86	1.64	934
		BP	89.21	44.1	0.79	1.62	926
CIFAR-10	3×2000 MLP	MF	62.34	177.7	3.17	6.70	1120
		BP	61.13	268.5	5.35	11.03	1184
CIFAR-100	3×2000 MLP	MF	30.31	110.4	2.02	3.46	1142
		BP	29.94	111.8	2.30	3.29	1192

MF achieves up to 1.2 pp higher test accuracy (CIFAR-10), 34% reduction in training time, and 41% reduction in energy consumption relative to BP. Peak memory demand is reduced by up to 5.4% on wide MLPs, despite the per-layer projection-matrix overhead; small models can incur a ~1% memory increase (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025).

6. Efficiency, Convergence Behavior, and Biological Plausibility

MF’s strictly local, forward-only updates provide several advantages:

Convergence: MF’s per-layer cross-entropy acts as a strong regularizer. Each layer optimizes a classification subproblem, resulting in lower final validation losses than BP and improved minimums in the loss landscape. The effect is especially pronounced for deep or wide MLPs (Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025).
Parallelizability and Hardware Utilization: Updates can be applied as soon as local activations are computed, supporting asynchrony and improved parallelism. GPU profiling indicates lower but more consistent streaming multiprocessor utilization, shorter training times, and reduced latency versus BP (Gong et al., 16 Jan 2025).
Memory Profile: The ‘forward-only’ property obviates the requirement to cache activations for global gradient backpropagation, making MF suitable for memory-constrained hardware and potential custom accelerator designs (Spyra et al., 2 Nov 2025).
Biological Plausibility: MF requires no symmetric weight transport or global error signals, with each layer able to execute updates using only locally available information—aligning more closely with synaptic plasticity as observed in biological learning (Gong et al., 16 Jan 2025).

7. Limitations, Extensions, and Research Directions

The current validation of MF is restricted to MLP architectures, with all performance, energy, and memory claims established on this architecture class. Extension to convolutional and transformer-based networks, along with hierarchical or block-wise MF variants for scalability, remains open. Research into parameter-efficient forms of the projection matrices, alternative per-layer objectives, and modification for recurrent or graph neural networks is ongoing (Spyra et al., 2 Nov 2025, Gong et al., 16 Jan 2025).

A plausible implication is that while MF yields pronounced advantages for fully connected MLPs, effective generalization to domains dominated by convolutional or attention-based networks (e.g., large-scale image and sequence modeling) may require significant architectural adaptation or hybridization.

References:

(Spyra et al., 2 Nov 2025, Spyra, 23 Sep 2025, Gong et al., 16 Jan 2025)