Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices (2403.14958v1)

Published 22 Mar 2024 in cs.LG, cs.CL, and math.OC

Abstract: As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data. Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques. Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment. Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence. In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% and 33.8% to 49.9% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 32, 2019.
  2. Computing low-rank approximations of large-scale matrices with the tensor network randomized svd. SIAM Journal on Matrix Analysis and Applications, 39(3):1221–1244, 2018.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Denoising of hyperspectral images using nonconvex low rank matrix approximation. IEEE Transactions on Geoscience and Remote Sensing, 55(9):5366–5380, 2017.
  5. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  6. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  7. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  8. Matrix computations. JHU press, 2013.
  9. Patch-based image inpainting via two-stage low rank approximation. IEEE transactions on visualization and computer graphics, 24(6):2023–2036, 2017.
  10. Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. In International conference on machine learning, pp. 2007–2015. PMLR, 2014.
  11. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  14. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  15. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
  16. Memory efficient optimizers with 4-bit states. arXiv preprint arXiv:2309.01507, 2023.
  17. Low-rank matrix approximation with stability. In International Conference on Machine Learning, pp. 295–303. PMLR, 2016.
  18. Large-scale nyström kernel matrix approximation using randomized svd. IEEE transactions on neural networks and learning systems, 26(1):152–164, 2014.
  19. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
  20. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  21. Came: Confidence-guided adaptive memory efficient optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4442–4453, 2023.
  22. Nakatsukasa, Y. Fast and stable randomized low-rank matrix approximation. arXiv preprint arXiv:2009.11392, 2020.
  23. Sparse pca through low-rank approximations. In International Conference on Machine Learning, pp. 747–755. PMLR, 2013.
  24. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pp.  8024–8035, 2019.
  25. Paterek, A. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop, volume 2007, pp. 5–8, 2007.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  28. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2010.
  29. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  30. Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6):1015–1034, 2008.
  31. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  33. Model: memory optimizations for deep learning. In International Conference on Machine Learning, pp. 32618–32632. PMLR, 2023.
  34. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
  35. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com