Papers
Topics
Authors
Recent
Search
2000 character limit reached

Input Normalized Stochastic Gradient Descent Training of Deep Neural Networks

Published 20 Dec 2022 in cs.LG and eess.SP | (2212.09921v2)

Abstract: In this paper, we propose a novel optimization algorithm for training machine learning models called Input Normalized Stochastic Gradient Descent (INSGD), inspired by the Normalized Least Mean Squares (NLMS) algorithm used in adaptive filtering. When training complex models on large datasets, the choice of optimizer parameters, particularly the learning rate, is crucial to avoid divergence. Our algorithm updates the network weights using stochastic gradient descent with $\ell_1$ and $\ell_2$-based normalizations applied to the learning rate, similar to NLMS. However, unlike existing normalization methods, we exclude the error term from the normalization process and instead normalize the update term using the input vector to the neuron. Our experiments demonstrate that our optimization algorithm achieves higher accuracy levels compared to different initialization settings. We evaluate the efficiency of our training algorithm on benchmark datasets using ResNet-18, WResNet-20, ResNet-50, and a toy neural network. Our INSGD algorithm improves the accuracy of ResNet-18 on CIFAR-10 from 92.42\% to 92.71\%, WResNet-20 on CIFAR-100 from 76.20\% to 77.39\%, and ResNet-50 on ImageNet-1K from 75.52\% to 75.67\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
  2. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  4. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  5. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  6. S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
  7. L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers.   Springer, 2010, pp. 177–186.
  8. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  9. Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 265–272.
  10. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  11. B. Widrow and M. E. Hoff, “Adaptive switching circuits,” Stanford Univ Ca Stanford Electronics Labs, Tech. Rep., 1960.
  12. B. Widrow, “Thinking about thinking: the discovery of the lms algorithm,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 100–106, 2005.
  13. O. Gunay, B. U. Toreyin, K. Kose, and A. E. Cetin, “Entropy-functional-based online adaptive decision fusion framework with application to wildfire detection in video,” IEEE Transactions on Image Processing, vol. 21, no. 5, pp. 2853–2865, 2012.
  14. M. O. Sayin, N. D. Vanli, and S. S. Kozat, “A novel family of adaptive filtering algorithms based on the logarithmic cost,” IEEE Transactions on Signal Processing, vol. 62, no. 17, pp. 4411–4424, 2014.
  15. O. Arikan, A. Enis Cetin, and E. Erzin, “Adaptive filtering for non-gaussian stable processes,” IEEE Signal Processing Letters, vol. 1, no. 11, pp. 163–165, 1994.
  16. O. Arikan, M. Belge, A. Cetin, and E. Erzin, “Adaptive filtering approaches for non-gaussian stable processes,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 1995, pp. 1400–1403 vol.2.
  17. G. Aydin, O. Arikan, and A. Cetin, “Robust adaptive filtering algorithms for /spl alpha/-stable random processes,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 46, no. 2, pp. 198–202, 1999.
  18. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
  19. Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, “A theoretical framework for back-propagation,” in Proceedings of the 1988 connectionist models summer school, vol. 1, 1988, pp. 21–28.
  20. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of machine learning research, vol. 12, no. 7, 2011.
  21. B. Singh, S. De, Y. Zhang, T. Goldstein, and G. Taylor, “Layer-specific adaptive learning rates for deep networks,” in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).   IEEE, 2015, pp. 364–368.
  22. S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
  23. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  24. “Imagenet training in pytorch,” https://github.com/pytorch/examples/tree/main/imagenet, 2022, accessed: 2022-12-27.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.