The Cascaded Forward Algorithm for Neural Network Training (2303.09728v3)
Abstract: Backpropagation algorithm has been widely used as a mainstream learning procedure for neural networks in the past decade, and has played a significant role in the development of deep learning. However, there exist some limitations associated with this algorithm, such as getting stuck in local minima and experiencing vanishing/exploding gradients, which have led to questions about its biological plausibility. To address these limitations, alternative algorithms to backpropagation have been preliminarily explored, with the Forward-Forward (FF) algorithm being one of the most well-known. In this paper we propose a new learning framework for neural networks, namely Cascaded Forward (CaFo) algorithm, which does not rely on BP optimization as that in FF. Unlike FF, our framework directly outputs label distributions at each cascaded block, which does not require generation of additional negative samples and thus leads to a more efficient process at both training and testing. Moreover, in our framework each block can be trained independently, so it can be easily deployed into parallel acceleration systems. The proposed method is evaluated on four public image classification benchmarks, and the experimental results illustrate significant improvement in prediction accuracy in comparison with the baseline.
- Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Decision fusion networks for image classification. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
- Neural machine translation with gru-gated attention model. IEEE transactions on neural networks and learning systems, 31(11):4688–4698, 2020.
- Hierarchical human-like deep neural networks for abstractive text summarization. IEEE Transactions on Neural Networks and Learning Systems, 32(6):2744–2757, 2020.
- Neighborhood pattern is crucial for graph convolutional networks performing node classification. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
- Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
- Overfitting and neural networks: conjugate gradient and backpropagation. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, volume 1, pages 114–119. IEEE, 2000.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014.
- Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7(1):13276, 2016.
- Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. Advances in neural information processing systems, 29, 2016.
- Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in neuroscience, 15:629892, 2021.
- Error-driven input modulation: solving the credit assignment problem without a backward pass. In International Conference on Machine Learning, pages 4937–4955. PMLR, 2022.
- Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations. arXiv preprint arXiv:2212.13345, 2022.
- Constructing a schema: The case of the chain rule? The Journal of Mathematical Behavior, 16(4):345–364, 1997.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
- Lutz Prechelt. Early stopping—but when? Neural networks: tricks of the trade: second edition, pages 53–67, 2012.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- Reluplex made more practical: Leaky relu. In 2020 IEEE Symposium on Computers and communications (ISCC), pages 1–7. IEEE, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- The comparison of l1 and l2-norm minimization methods. International Journal of the Physical Sciences, 5(11):1721–1727, 2010.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- The predictive forward-forward algorithm. arXiv preprint arXiv:2301.01452, 2023.
- Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11(1):23–63, 1987.
- Decoupled neural interfaces using synthetic gradients. In International conference on machine learning, pages 1627–1635. PMLR, 2017.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016.
- Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
- Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100, 1998.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Learning multiple layers of features from tiny images. 2009.
- Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.