Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Supervised Local Learning with Augmented Auxiliary Networks (2402.17318v1)

Published 27 Feb 2024 in cs.NE, cs.CV, and cs.LG

Abstract: Deep neural networks are typically trained using global error signals that backpropagate (BP) end-to-end, which is not only biologically implausible but also suffers from the update locking problem and requires huge memory consumption. Local learning, which updates each layer independently with a gradient-isolated auxiliary network, offers a promising alternative to address the above problems. However, existing local learning methods are confronted with a large accuracy gap with the BP counterpart, particularly for large-scale networks. This is due to the weak coupling between local layers and their subsequent network layers, as there is no gradient communication across layers. To tackle this issue, we put forward an augmented local learning method, dubbed AugLocal. AugLocal constructs each hidden layer's auxiliary network by uniformly selecting a small subset of layers from its subsequent network layers to enhance their synergy. We also propose to linearly reduce the depth of auxiliary networks as the hidden layer goes deeper, ensuring sufficient network capacity while reducing the computational cost of auxiliary networks. Our extensive experiments on four image classification datasets (i.e., CIFAR-10, SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up to tens of local layers with a comparable accuracy to BP-trained networks while reducing GPU memory usage by around 40%. The proposed AugLocal method, therefore, opens up a myriad of opportunities for training high-performance deep neural networks on resource-constrained platforms.Code is available at https://github.com/ChenxiangMA/AugLocal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Deep learning without weight transport. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  2. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Advances in Neural Information Processing Systems, 31, 2018.
  3. Gradients without backpropagation. arXiv preprint arXiv:2202.08587, 2022.
  4. Greedy layerwise learning can scale to imagenet. In International Conference on Machine Learning, pp.  583–593. PMLR, 2019.
  5. Decoupled greedy learning of cnns. In International Conference on Machine Learning, pp.  736–745. PMLR, 2020.
  6. Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906, 2014.
  7. Greedy layer-wise training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman (eds.), Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
  8. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
  9. Spike timing–dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci., 31:25–46, 2008.
  10. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  11. Credit assignment through broadcasting a global error vector. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  10053–10066. Curran Associates, Inc., 2021.
  12. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pp.  215–223, 2011.
  13. Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255. Ieee, 2009.
  15. Can forward gradient match backpropagation? In International Conference on Machine Learning, 2023.
  16. Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in Neuroscience, 15:629892, 2021.
  17. Interlocking backpropagation: Improving depthwise model-parallelism. Journal of Machine Learning Research, 23(171):1–28, 2022.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016.
  19. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  20. Donald Olding Hebb. The organization of behavior: a neuropsycholocigal theory. A Wiley Book in Clinical Psychology, 62:78, 1949.
  21. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  22. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp.  646–661. Springer, 2016.
  23. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4700–4708, 2017.
  24. Training neural networks using features replay. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a.
  25. Decoupled parallel backpropagation with convergence guarantee. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2098–2106. PMLR, 10–15 Jul 2018b.
  26. Local plasticity rules can learn deep representations using self-supervised contrastive predictions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  30365–30379. Curran Associates, Inc., 2021.
  27. Decoupled neural interfaces using synthetic gradients. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1627–1635. PMLR, 06–11 Aug 2017.
  28. Hebbian deep learning without feedback. In The Eleventh International Conference on Learning Representations, 2023.
  29. Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  3519–3529. PMLR, 09–15 Jun 2019.
  30. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  31. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  32. Parallel training of deep networks with local updates. arXiv preprint arXiv:2012.03837, 2020.
  33. Yann Le Cun. Learning process in an asymmetric threshold network. In Disordered Systems and Biological Organization, pp.  233–240. Springer, 1986.
  34. Deep learning. Nature, 521(7553):436–444, 2015.
  35. Deeply-supervised nets. In Artificial Intelligence and Statistics, pp.  562–570. PMLR, 2015a.
  36. Difference target propagation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15, pp.  498–515. Springer, 2015b.
  37. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7(1):13276, 2016.
  38. Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
  39. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  40. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  41. Putting an end to end-to-end: Gradient-isolated learning of representations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  42. Deep supervised learning using local errors. Frontiers in Neuroscience, 12:608, 2018.
  43. Reading digits in natural images with unsupervised feature learning. 2011.
  44. Arild Nø kland. Direct feedback alignment provides learning in deep neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  45. Training neural networks with local error signals. In International Conference on Machine Learning, pp.  4839–4850. PMLR, 2019.
  46. {SEDONA}: Search for decoupled neural networks toward greedy block-wise learning. In International Conference on Learning Representations, 2021.
  47. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10428–10436, 2020.
  48. Scaling forward gradient with local losses. In The Eleventh International Conference on Learning Representations, 2023.
  49. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  50. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4510–4520, 2018.
  51. Blockwise self-supervised learning with barlow twins, 2023.
  52. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  53. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  1–9, 2015.
  54. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp.  6105–6114. PMLR, 2019.
  55. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  56. Revisiting locally supervised learning: an alternative to end-to-end training. In International Conference on Learning Representations (ICLR), 2021.
  57. Activation sharing with asymmetric paths solves weight transport problem without bidirectional connection. Advances in Neural Information Processing Systems, 34:29697–29709, 2021.
  58. Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2318–2328, 2021.
  59. Loco: Local contrastive representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  11142–11153. Curran Associates, Inc., 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chenxiang Ma (12 papers)
  2. Jibin Wu (42 papers)
  3. Chenyang Si (36 papers)
  4. Kay Chen Tan (83 papers)
Citations (2)

Summary

  • The paper demonstrates AugLocal's novel method that leverages augmented auxiliary networks to bridge the performance gap with backpropagation.
  • It introduces a pyramidal structure that couples hidden and downstream layers for efficient, parallelized training.
  • Empirical evaluations on CIFAR-10, SVHN, STL-10, and ImageNet confirm comparable accuracy with reduced GPU memory usage.

Enhancing Supervised Local Learning with Augmented Auxiliary Networks for Deep Neural Architectures

Introduction

The evolution of artificial neural networks, specifically deep learning models, notably so in pattern recognition tasks, is predominantly underpinned by the backpropagation algorithm (BP). Despite its widespread application, BP's biological implausibility and inefficiencies, such as the requirement for substantial memory consumption and the update locking problem, have motivated the exploration of alternative training methodologies. Local learning, wherein each layer of the neural network is updated independently, emerges as a viable solution, sidestepping the pitfalls of BP.

Supervised Local Learning: A Primer

Conventional supervised local learning methods operate by employing gradient-isolated auxiliary networks that facilitate the independent optimization of each hidden layer. This approach inherently circumvents the update locking issue prevalent in BP, enabling more efficient parallelization of the training process. However, the granularization into independent layer-wise optimization procedures has historically led to a not insignificant performance gap compared to traditional BP, chiefly due to the absence of inter-layer gradient communication.

AugLocal: A Novel Approach

Addressing the aforementioned limitations, this paper presents AugLocal - a method that innovatively fortifies the coupling between a hidden layer and its subsequent layers through the strategic construction of auxiliary networks. This is achieved by selectively incorporating a subset of layers from the primary network into each hidden layer’s auxiliary network, thereby promoting feature representations beneficial for layers downstream. The method adopts a pyramidal structure, linearly decreasing the depth of auxiliary networks for deeper hidden layers, optimizing both accuracy and computational efficiency.

Empirical Validation

Extensive evaluations across multiple benchmarks, including CIFAR-10, SVHN, STL-10, and ImageNet, illustrate AugLocal's capability to significantly reduce the performance disparity to BP-trained networks. Achieving comparable accuracies while concurrently realizing a notable decrease in GPU memory usage (up to 40%), AugLocal's efficacy is underscored across various network architectures, including ResNet, VGG, MobileNet, EfficientNet, and RegNet. This universality accentuates AugLocal's potential as a scalable local learning rule adaptable to a wide array of deep learning tasks and architectures.

Theoretical Implications and Future Directions

AugLocal's approach of constructing auxiliary networks not only demonstrates a scalable solution to the challenges of supervising local learning in large-scale networks but also lays a foundational premise for future exploration in the domain. The analysis of hidden representations learned through AugLocal compared to BP provides insightful revelations into the compositional dynamics at play, giving rise to avenues for further empirical and theoretical inquiry into local learning mechanisms and their optimization.

Practical Considerations and Advancements

In the practical field, AugLocal paves the way for the deployment of high-performance deep neural networks on resource-constrained platforms, manifesting a methodological shift from traditional BP. The notable reduction in memory footprint offers tangible benefits for applications where resource allocation is a critical concern, such as edge computing and mobile deployments.

Concluding Remarks

In summary, AugLocal represents a significant methodological advancement in the field of supervised local learning. By effectively bridging the performance gap to BP, reducing computational overheads, and offering a scalable solution across various datasets and neural architectures, AugLocal marks a pivotal point in the exploration of alternative neural network training paradigms. The implications of this work not only resonate with the immediate sphere of deep learning research but also extend to the broader discourse on efficient, scalable, and biologically plausible learning algorithms.

X Twitter Logo Streamline Icon: https://streamlinehq.com