On the numerical reliability of nonsmooth autodiff: a MaxPool case study (2401.02736v2)
Abstract: This paper considers the reliability of automatic differentiation (AD) for neural networks involving the nonsmooth MaxPool operation. We investigate the behavior of AD across different precision levels (16, 32, 64 bits) and convolutional architectures (LeNet, VGG, and ResNet) on various datasets (MNIST, CIFAR10, SVHN, and ImageNet). Although AD can be incorrect, recent research has shown that it coincides with the derivative almost everywhere, even in the presence of nonsmooth operations (such as MaxPool and ReLU). On the other hand, in practice, AD operates with floating-point numbers (not real numbers), and there is, therefore, a need to explore subsets on which AD can be numerically incorrect. These subsets include a bifurcation zone (where AD is incorrect over reals) and a compensation zone (where AD is incorrect over floating-point numbers but correct over reals). Using SGD for the training process, we study the impact of different choices of the nonsmooth Jacobian for the MaxPool function on the precision of 16 and 32 bits. These findings suggest that nonsmooth MaxPool Jacobians with lower norms help maintain stable and efficient test accuracy, whereas those with higher norms can result in instability and decreased performance. We also observe that the influence of MaxPool's nonsmooth Jacobians on learning can be reduced by using batch normalization, Adam-like optimizers, or increasing the precision level.
- Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- Computationally relevant generalized derivatives: theory, evaluation and applications. Optimization Methods and Software, 33(4-6):1030–1072, 2018.
- Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43, 2018.
- Numerical influence of relu’(0) on backpropagation, 2023.
- On the complexity of nonsmooth automatic differentiation. In The Eleventh International Conference on Learning Representations, 2022.
- Nonsmooth implicit differentiation for machine learning and optimization. CoRR, abs/2106.04350, 2021.
- Nonsmooth implicit differentiation for machine-learning and optimization. Advances in Neural Information Processing Systems, 34, 2021.
- Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Mathematical Programming, pages 1–33, 2020.
- A mathematical model for automatic differentiation in machine learning. In Conference on Neural Information Processing Systems, 2020.
- Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1983.
- Training deep neural networks with low precision multiplications. In Proceedings of the International Conference on Learning Representations, 2015.
- Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics., 2018.
- Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 20(1):119–154, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
- Deep learning. MIT press, 2016.
- A. Griewank and A. Rojas. Treating artificial neural net training as a nonsmooth global optimization problem. In International Conference on Machine Learning, Optimization, and Data Science (pp. 759-770). Springer, Cham., 2019.
- A. Griewank and A. Walther. Beyond the oracle: Opportunities of piecewise differentiation. In Numerical Nonsmooth Optimization (pp. 331-361). Springer, Cham., 2020.
- Andreas Griewank. On stable piecewise linearization and generalized algorithmic differentiation. Optimization Methods and Software, 28, 07 2013.
- Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.
- On lipschitz optimization based on gray-box piecewise linearization. Mathematical Programming, 158:383–415, 2016.
- Deep learning with limited numerical precision. In International conference on machine learning, pages 1737–1746. PMLR, 2015.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Long short-term memory. Neural Computation, 9:1735–1780, 1997.
- Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Fixed-point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning, 2014.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.
- Provably correct automatic sub-differentiation for qualified programs. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40(7):1–9, 2010.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Deep learning. nature, 521(7553):436–444, 2015.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- On the correctness of automatic differentiation for neural networks with machine-representable parameters, 2023.
- On correctness of automatic differentiation for non-differentiable functions. In NeurIPS 2020-34th Conference on Neural Information Processing Systems, 2020.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
- Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Bert Speelpenning. Compiling fast partial derivatives of functions given by algorithms. University of Illinois at Urbana-Champaign, 1980.
- Improving the speed of neural networks on cpus. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop, 2011.
- A neural network for speaker-independent isolated word recognition. In ICSLP, 1990.
- Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833, 2014.