Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions (2302.03764v2)

Published 7 Feb 2023 in stat.ML, cs.AI, and cs.LG

Abstract: Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
  2. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  3. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  4. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
  5. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
  6. Efficient full-matrix adaptive regularization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 102–110. PMLR, 09–15 Jun 2019.
  7. Extreme tensoring for low-memory preconditioning. In International Conference on Learning Representations, 2019.
  8. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019.
  9. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
  10. On the factory floor: Ml engineering for industrial-scale ads recommendation models, 2022.
  11. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
  12. Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42–51, 2021.
  13. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
  14. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  15. Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  16. Edo Liberty. Even simpler deterministic matrix sketching. arXiv preprint arXiv:2202.01780, 2022.
  17. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
  18. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  19. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019.
  20. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481–9488, 2021.
  21. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199, 2017.
  22. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
  23. Understanding and exploiting the low-rank structure of deep networks. 2018.
  24. Rethinking the structure of stochastic gradients: Empirical and statistical evidence. arXiv preprint arXiv:2212.02083, 2022.
  25. Scalable adaptive stochastic optimization using random projections. Advances in Neural Information Processing Systems, 29, 2016.
  26. Efficient adaptive online learning via frequent directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  27. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems, 29, 2016.
  28. Efficient and robust high-dimensional linear contextual bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4259–4265, 2021.
  29. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  30. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  31. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  32. Ashok Cutkosky. Better full-matrix regret via parameter-free online learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8836–8846. Curran Associates, Inc., 2020.
  33. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  34. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  35. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
  36. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  37. Open graph benchmark: Datasets for machine learning on graphs. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  38. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400, 2019.
  39. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  40. Andrew V Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing, 23(2):517–541, 2001.
  41. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  42. When does preconditioning help or hurt generalization? In International Conference on Learning Representations, 2020.
  43. Robust frequent directions with application in online learning. The Journal of Machine Learning Research, 20(1):1697–1737, 2019.
  44. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  45. Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197–209. PMLR, 2014.
  46. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
  47. Koenraad MR Audenaert. A generalisation of mirsky’s singular value inequalities. arXiv preprint arXiv:1410.4941, 2014.
  48. init2winit: a jax codebase for initialization, optimization, and tuning research, 2021. URL http://github.com/google/init2winit.
  49. JAX: composable transformations of Python+NumPy programs, 2018.
  50. Flax: A neural network library and ecosystem for JAX, 2020.
  51. Tensorflow datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets, 2023.
  52. Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021.
  53. J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.
  54. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2.
  55. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  56. MLCommons® open engineering consortium. MLCommons Algorithmic Efficiency. https://github.com/mlcommons/algorithmic-efficiency, 2023.
  57. Mlperf inference benchmark, 2019.
  58. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  59. Evaluation of Distributed Shampoo: Comparison of optimizers: Distributed Shampoo, Adam & Adafactor. Weights & Biases Report, 2022.
  60. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
  61. Disentangling adaptive gradient methods from learning rates. arXiv preprint arXiv:2002.11803, 2020.
  62. Tsuyoshi Ando. Concavity of certain maps on positive definite matrices and applications to hadamard products. Linear algebra and its applications, 26:203–241, 1979.
  63. Rajendra Bhatia. Matrix analysis. Springer, 1997.
  64. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
  65. Roger W Brockett. Finite dimensional linear systems. SIAM, 2015.
Citations (10)

Summary

  • The paper demonstrates that a low-rank sketch of gradient covariance via Frequent Directions can yield memory efficiency while maintaining adaptive regularization performance.
  • It leverages spectral analysis in online convex optimization to justify the dynamic low-rank approach that achieves regret bounds similar to full-matrix AdaGrad.
  • Empirical evaluations on models like ResNet and Conformer confirm reduced memory usage and competitive performance, enabling scalable deep learning training.

Analysis of "#Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions"

This paper introduces a novel approach to memory-efficient adaptive regularization in deep learning optimization through the use of the Frequent Directions (FD) sketch, focusing on efficiently managing the Kronecker-factored gradient covariance matrix. The authors propose a dynamic low-rank sketching method adapted to second-order optimization, incorporating a novel regret analysis specifically tailored for the online convex optimization (OCO) setting. This approach seeks to improve upon standard adaptive gradient methods, like Adam or classical AdaGrad, by reducing the memory footprint tied to maintaining a dense matrix preconditioner, while also capitalizing on the spectral properties observed in gradient covariance matrices during training.

Main Contributions

  1. Spectral Analysis of Gradient Covariance: The authors provide evidence that the gradient covariance matrix's spectrum in deep learning models is concentrated in a small leading eigenspace. This observation underpins the choice of using a low-rank matrix sketching approach, as it suggests only a minor portion of the eigenspace needs to be tracked for effective adaptive regularization.
  2. Frequent Directions in Online Convex Optimization: Employing the FD sketch, the paper presents a memory-efficient approach that achieves regret bounds similar to the full-matrix AdaGrad, with substantially less memory usage. The authors demonstrate this through a novel analysis method in OCO, merging FD with dynamic diagonal regularization.
  3. Algorithm Development and Evaluation: The paper extends the use of FD to various adaptive optimization algorithms, including Shampoo and variations that utilize exponential moving averages. These algorithms are evaluated across several modern deep learning settings, demonstrating competitive performance with traditional methods that require at least linear memory relative to parameter count.
  4. Practical Implementations: Practical applications of the proposed algorithms were tested in neural network training settings, notably with architectures like ResNet and the Conformer model, showcasing the applicability of the Sketchy approach in real-world scenarios. Their experiments emphasized reductions in memory consumption while retaining competitive performance metrics when compared to existing optimizers such as Adam.
  5. Empirical Results: The rigorous empirical investigations show that the Sketchy approach can improve the overall memory-quality tradeoff efficiently. The experimental analysis highlights a Pareto improvement, leveraging a higher-rank approximation rather than resorting to rank-1 preconditioning.

Implications and Future Directions

The use of the FD sketch for managing memory efficiency in deep learning optimization presents significant advantages, particularly as model sizes continue to grow. From a theoretical standpoint, the paper advances a novel application of spectral analysis in gradient covariance matrices, providing a foundation for future work in optimizing resource usage in model training without sacrificing convergence performance.

Practically, the proposed methods help mitigate memory bandwidth issues that have become increasingly prominent due to hardware limitations, contributing crucial insights to the efficiency of large-scale model training environments. This gap between increasing logical performance and slower memory bandwidth growth is crucial for researchers and practitioners when devising strategies for deploying large models efficiently.

For future developments, potential research could explore optimizations beyond the current rank tuning restrictions, adaptive FC-based schedules, or leveraging these spectral properties across diverse architectures and domains. Additionally, further investigation into the equilibrium between memoization and computational trade-offs under varying network and architectural constraints will remain an area of vital interest.

In conclusion, the proposed Sketchy algorithms provide a sound contribution towards more memory-efficient deep learning optimizations, balancing the nuanced relationship between memory constraints and algorithmic performance in high-demand AI applications.