Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives (2312.03885v2)

Published 6 Dec 2023 in cs.LG and math.OC

Abstract: We consider a gradient-based optimization method applied to a function $\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where $\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such as training neural networks by gradient descent. First, we propose a computationally inexpensive technique providing higher-order information on $\mathcal{L}$, especially about the interactions between the tensors $\mathbf{T}_s$, based on automatic differentiation and computational tricks. Second, we use this technique at order 2 to build a second-order optimization method which is suitable, among other things, for training deep neural networks of various architectures. This second-order method leverages the partition structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$, in such a way that it requires neither the computation of the Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any approximation of it. The key part consists in computing a smaller matrix interpretable as a "Hessian according to the partition", which can be computed exactly and efficiently. In contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, the method we propose does not neglect interactions between layers. Finally, we can tune the coarseness of the partition to recover well-known optimization methods: the coarsest case corresponds to Cauchy's steepest descent method, the finest case corresponds to the usual Newton's method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276.
  2. Cauchy, A.-L. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, Paris, 25:536–538.
  3. Dangel, F. J. (2023). Backpropagation beyond the gradient. PhD thesis, Universität Tübingen.
  4. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR.
  5. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
  6. Practical optimization. Academic Press, San Diego.
  7. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE.
  8. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7.
  9. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR.
  10. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  11. Optimal brain damage. Advances in neural information processing systems, 2.
  12. Second order properties of error surfaces: Learning time and generalization. Advances in neural information processing systems, 3.
  13. Linear and Nonlinear Programming. Springer, fourth edition.
  14. Martens, J. (2010). Deep learning via hessian-free optimization. In International conference on machine learning. PMLR.
  15. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR.
  16. Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
  17. Nesterov, Y. (2021). Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191:1–30.
  18. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
  19. Numerical optimization. Springer.
  20. Ollivier, Y. (2015). Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818.
  21. Pearlmutter, B. A. (1994). Fast exact multiplication by the hessian. Neural computation, 6(1):147–160.
  22. Empirical analysis of the hessian of over-parametrized neural networks. In International Conference on Learning Representations.
  23. A second-order learning algorithm for multilayer networks based on block hessian matrix. Neural Networks, 11(9):1607–1622.
  24. Sketched newton–raphson. SIAM Journal on Optimization, 32(3):1555–1583.
  25. Are all layers created equal? The Journal of Machine Learning Research, 23(1):2930–2957.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets