Scaling Supervised Local Learning with Augmented Auxiliary Networks (2402.17318v1)
Abstract: Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.
- Deep learning without weight transport. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Advances in Neural Information Processing Systems, 31, 2018.
- Gradients without backpropagation. arXiv preprint arXiv:2202.08587, 2022.
- Greedy layerwise learning can scale to imagenet. In International Conference on Machine Learning, pp. 583–593. PMLR, 2019.
- Decoupled greedy learning of cnns. In International Conference on Machine Learning, pp. 736–745. PMLR, 2020.
- Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906, 2014.
- Greedy layer-wise training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
- Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
- Spike timing–dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci., 31:25–46, 2008.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Credit assignment through broadcasting a global error vector. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10053–10066. Curran Associates, Inc., 2021.
- An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pp. 215–223, 2011.
- Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Ieee, 2009.
- Can forward gradient match backpropagation? In International Conference on Machine Learning, 2023.
- Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in Neuroscience, 15:629892, 2021.
- Interlocking backpropagation: Improving depthwise model-parallelism. Journal of Machine Learning Research, 23(171):1–28, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Donald Olding Hebb. The organization of behavior: a neuropsycholocigal theory. A Wiley Book in Clinical Psychology, 62:78, 1949.
- A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 646–661. Springer, 2016.
- Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708, 2017.
- Training neural networks using features replay. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a.
- Decoupled parallel backpropagation with convergence guarantee. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2098–2106. PMLR, 10–15 Jul 2018b.
- Local plasticity rules can learn deep representations using self-supervised contrastive predictions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 30365–30379. Curran Associates, Inc., 2021.
- Decoupled neural interfaces using synthetic gradients. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1627–1635. PMLR, 06–11 Aug 2017.
- Hebbian deep learning without feedback. In The Eleventh International Conference on Learning Representations, 2023.
- Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3519–3529. PMLR, 09–15 Jun 2019.
- Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
- Parallel training of deep networks with local updates. arXiv preprint arXiv:2012.03837, 2020.
- Yann Le Cun. Learning process in an asymmetric threshold network. In Disordered Systems and Biological Organization, pp. 233–240. Springer, 1986.
- Deep learning. Nature, 521(7553):436–444, 2015.
- Deeply-supervised nets. In Artificial Intelligence and Statistics, pp. 562–570. PMLR, 2015a.
- Difference target propagation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15, pp. 498–515. Springer, 2015b.
- Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7(1):13276, 2016.
- Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Putting an end to end-to-end: Gradient-isolated learning of representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Deep supervised learning using local errors. Frontiers in Neuroscience, 12:608, 2018.
- Reading digits in natural images with unsupervised feature learning. 2011.
- Arild Nø kland. Direct feedback alignment provides learning in deep neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Training neural networks with local error signals. In International Conference on Machine Learning, pp. 4839–4850. PMLR, 2019.
- {SEDONA}: Search for decoupled neural networks toward greedy block-wise learning. In International Conference on Learning Representations, 2021.
- Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436, 2020.
- Scaling forward gradient with local losses. In The Eleventh International Conference on Learning Representations, 2023.
- Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
- Blockwise self-supervised learning with barlow twins, 2023.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. PMLR, 2019.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Revisiting locally supervised learning: an alternative to end-to-end training. In International Conference on Learning Representations (ICLR), 2021.
- Activation sharing with asymmetric paths solves weight transport problem without bidirectional connection. Advances in Neural Information Processing Systems, 34:29697–29709, 2021.
- Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2318–2328, 2021.
- Loco: Local contrastive representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 11142–11153. Curran Associates, Inc., 2020.
- Chenxiang Ma (12 papers)
- Jibin Wu (42 papers)
- Chenyang Si (36 papers)
- Kay Chen Tan (83 papers)