Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives (2312.03885v2)
Abstract: We consider a gradient-based optimization method applied to a function $\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where $\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such as training neural networks by gradient descent. First, we propose a computationally inexpensive technique providing higher-order information on $\mathcal{L}$, especially about the interactions between the tensors $\mathbf{T}_s$, based on automatic differentiation and computational tricks. Second, we use this technique at order 2 to build a second-order optimization method which is suitable, among other things, for training deep neural networks of various architectures. This second-order method leverages the partition structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$, in such a way that it requires neither the computation of the Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any approximation of it. The key part consists in computing a smaller matrix interpretable as a "Hessian according to the partition", which can be computed exactly and efficiently. In contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, the method we propose does not neglect interactions between layers. Finally, we can tune the coarseness of the partition to recover well-known optimization methods: the coarsest case corresponds to Cauchy's steepest descent method, the finest case corresponds to the usual Newton's method.
- Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276.
- Cauchy, A.-L. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, Paris, 25:536–538.
- Dangel, F. J. (2023). Backpropagation beyond the gradient. PhD thesis, Universität Tübingen.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
- Practical optimization. Academic Press, San Diego.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE.
- Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7.
- Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Optimal brain damage. Advances in neural information processing systems, 2.
- Second order properties of error surfaces: Learning time and generalization. Advances in neural information processing systems, 3.
- Linear and Nonlinear Programming. Springer, fourth edition.
- Martens, J. (2010). Deep learning via hessian-free optimization. In International conference on machine learning. PMLR.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR.
- Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
- Nesterov, Y. (2021). Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191:1–30.
- Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
- Numerical optimization. Springer.
- Ollivier, Y. (2015). Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818.
- Pearlmutter, B. A. (1994). Fast exact multiplication by the hessian. Neural computation, 6(1):147–160.
- Empirical analysis of the hessian of over-parametrized neural networks. In International Conference on Learning Representations.
- A second-order learning algorithm for multilayer networks based on block hessian matrix. Neural Networks, 11(9):1607–1622.
- Sketched newton–raphson. SIAM Journal on Optimization, 32(3):1555–1583.
- Are all layers created equal? The Journal of Machine Learning Research, 23(1):2930–2957.