Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions (2302.03764v2)
Abstract: Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
- Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
- Efficient full-matrix adaptive regularization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 102–110. PMLR, 09–15 Jun 2019.
- Extreme tensoring for low-memory preconditioning. In International Conference on Learning Representations, 2019.
- Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019.
- Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
- On the factory floor: Ml engineering for industrial-scale ads recommendation models, 2022.
- Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
- Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42–51, 2021.
- Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
- Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Edo Liberty. Even simpler deterministic matrix sketching. arXiv preprint arXiv:2202.01780, 2022.
- Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019.
- A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481–9488, 2021.
- Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199, 2017.
- Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
- Understanding and exploiting the low-rank structure of deep networks. 2018.
- Rethinking the structure of stochastic gradients: Empirical and statistical evidence. arXiv preprint arXiv:2212.02083, 2022.
- Scalable adaptive stochastic optimization using random projections. Advances in Neural Information Processing Systems, 29, 2016.
- Efficient adaptive online learning via frequent directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Efficient second order online learning by sketching. Advances in Neural Information Processing Systems, 29, 2016.
- Efficient and robust high-dimensional linear contextual bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4259–4265, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Ashok Cutkosky. Better full-matrix regret via parameter-free online learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8836–8846. Curran Associates, Inc., 2020.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Open graph benchmark: Datasets for machine learning on graphs. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400, 2019.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Andrew V Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing, 23(2):517–541, 2001.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- When does preconditioning help or hurt generalization? In International Conference on Learning Representations, 2020.
- Robust frequent directions with application in online learning. The Journal of Machine Learning Research, 20(1):1697–1737, 2019.
- Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
- Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197–209. PMLR, 2014.
- Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
- Koenraad MR Audenaert. A generalisation of mirsky’s singular value inequalities. arXiv preprint arXiv:1410.4941, 2014.
- init2winit: a jax codebase for initialization, optimization, and tuning research, 2021. URL http://github.com/google/init2winit.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Flax: A neural network library and ecosystem for JAX, 2020.
- Tensorflow datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets, 2023.
- Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021.
- J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.
- Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- MLCommons® open engineering consortium. MLCommons Algorithmic Efficiency. https://github.com/mlcommons/algorithmic-efficiency, 2023.
- Mlperf inference benchmark, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Evaluation of Distributed Shampoo: Comparison of optimizers: Distributed Shampoo, Adam & Adafactor. Weights & Biases Report, 2022.
- Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
- Disentangling adaptive gradient methods from learning rates. arXiv preprint arXiv:2002.11803, 2020.
- Tsuyoshi Ando. Concavity of certain maps on positive definite matrices and applications to hadamard products. Linear algebra and its applications, 26:203–241, 1979.
- Rajendra Bhatia. Matrix analysis. Springer, 1997.
- The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
- Roger W Brockett. Finite dimensional linear systems. SIAM, 2015.