Why Transformers Need Adam: A Hessian Perspective
Abstract: SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Estimating the spectral density of large implicit matrices. arXiv preprint arXiv:1802.03451, 2018.
- H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58(2):1–34, 2011.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pages 1352–1361. PMLR, 2021.
- Z. Bai and G. H. Golub. Bounds for the trace of the inverse and the determinant of symmetric positive definite matrices. Annals of Numerical Mathematics, 4:29–38, 1996.
- Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1-2):71–89, 1996.
- A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods. SIAM journal on Optimization, 23(4):2037–2060, 2013.
- signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
- S. Bock and M. Weiß. Non-convergence and limit cycles in the adam optimizer. In Artificial Neural Networks and Machine Learning–ICANN 2019: Deep Learning: 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, Part II 28, pages 232–243. Springer, 2019.
- C. Brezinski. A direct proof of the christoffel-darboux identity and its equivalence to the recurrence relationship. Journal of Computational and Applied Mathematics, 32(1-2):17–25, 1990.
- Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
- Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration. The Journal of Machine Learning Research, 23(1):10411–10457, 2022.
- Heavy-tailed noise does not explain the gap between sgd and adam on transformers. In 13th Annual Workshop on Optimization for Machine Learning, 2021.
- The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849, 2018.
- On the convergence of a class of adam-type algorithms for non-convex optimization. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- R. Collobert. Large scale machine learning. Technical report, Université de Paris VI, 2004.
- Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35:9955–9968, 2022.
- Lanczos algorithms for large symmetric eigenvalue computations: Vol. I: Theory. SIAM, 2002.
- A. B. Da Silva and M. Gazeau. A general system of differential equations to model first-order adaptive algorithms. The Journal of Machine Learning Research, 21(1):5072–5113, 2020.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- J. F. Epperson. An introduction to numerical methods and analysis. 2013.
- S. Gadat and I. Gavra. Asymptotic study of stochastic adaptive algorithms in non-convex landscape. The Journal of Machine Learning Research, 23(1):10357–10410, 2022.
- An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019.
- G. Goh. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL http://distill.pub/2017/momentum.
- G. H. Golub and G. Meurant. Matrices, moments and quadrature with applications, volume 30. Princeton University Press, 2009.
- G. H. Golub and Z. Strakoš. Estimates in quadratic formulas. Numerical Algorithms, 8:241–268, 1994.
- Calculation of gauss quadrature rules. Mathematics of computation, 23(106):221–230, 1969.
- Super-acceleration with cyclical step-sizes. In International Conference on Artificial Intelligence and Statistics, pages 3028–3065. PMLR, 2022.
- A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pages 4475–4483. PMLR, 2020.
- M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
- How does adaptive optimization impact local neural network geometry? Advances in Neural Information Processing Systems, 36, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. arXiv preprint arXiv:2304.13960, 2023.
- C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. 1950.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
- Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36, 2023.
- Z. Liao and M. W. Mahoney. Hessian eigenspectra of more realistic nonlinear models. Advances in Neural Information Processing Systems, 34:20104–20117, 2021.
- Approximating spectral densities of large matrices. SIAM review, 58(1):34–65, 2016.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- On the variance of the adaptive learning rate and beyond. arxiv 2019. arXiv preprint arXiv:1908.03265, 2019.
- Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249, 2020.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2018.
- J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. arXiv preprint arXiv:2010.09697, 2020.
- A theory on adam instability in large-scale machine learning. arXiv preprint arXiv:2304.09871, 2023.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- T. Q. Nguyen and J. Salazar. Transformers without tears: Improving the normalization of self-attention. arXiv preprint arXiv:1910.05895, 2019.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- Y. Pan and Y. Li. Toward understanding why adam converges faster than sgd for transformers. arXiv preprint arXiv:2306.00204, 2023.
- V. Papyan. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062, 2018.
- V. Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019.
- V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. The Journal of Machine Learning Research, 21(1):10197–10260, 2020.
- B. A. Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- Topmoumoute online natural gradient algorithm. Advances in neural information processing systems, 20, 2007.
- Y. Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
- Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481–9488, 2021.
- Rmsprop converges with proper hyper-parameter. In International Conference on Learning Representations, 2020.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- R. Sun. Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957, 2019.
- R. Sun and Y. Ye. Worst-case complexity of cyclic coordinate descent: O (n^ 2) o (n 2) gap with randomized version. Mathematical Programming, 185:487–520, 2021.
- Fast estimation of tr(f(a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Provable adaptivity in adam. arXiv preprint arXiv:2208.09900, 2022a.
- Closing the gap between the upper bound and lower bound of adam’s iteration complexity. Advances in Neural Information Processing Systems, 36, 2023a.
- Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161–190. PMLR, 2023b.
- Deepnet: Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555, 2022b.
- Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019.
- Wikipedia. Gaussian quadrature — Wikipedia, the free encyclopedia, 2023. URL https://en.wikipedia.org/w/index.php?title=Gaussian_quadrature&oldid=1191539517. [Online; accessed 20-January-2024].
- Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
- Dissecting hessian: Understanding common structure of hessian in neural networks. arXiv preprint arXiv:2010.04261, 2020.
- Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International conference on machine learning, pages 24430–24459. PMLR, 2022.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems, 31, 2018.
- Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pages 581–590. IEEE, 2020.
- Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Stabilizing transformer training by preventing attention entropy collapse. In International Conference on Machine Learning, pages 40770–40803. PMLR, 2023.
- Improving deep transformer with depth-scaled initialization and merged attention. arXiv preprint arXiv:1908.11365, 2019a.
- Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019b.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019c.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022a.
- Adam can converge without any modification on update rules. Advances in Neural Information Processing Systems, 35:28386–28399, 2022b.
- On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
- A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11127–11135, 2019.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.