Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence (2402.01515v1)

Published 2 Feb 2024 in cs.LG, cs.AI, and math.OC

Abstract: Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{|v_t|_22} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. Advances in Neural Information Processing Systems, 31, 2018.
  2. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699–707. PMLR, 2016.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
  5. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
  8. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  11. Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
  12. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  13. Learning-rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
  14. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
  15. Convergence of two-layer regression with nonlinear units. arXiv preprint arXiv:2308.08358, 2023.
  16. Unmasking transformers: A theoretical approach to data recovery via attention weights. arXiv preprint arXiv:2310.12462, 2023.
  17. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  18. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Learning multiple layers of features from tiny images. 2009.
  22. Building a large annotated corpus of english: The penn treebank. 1993.
  23. Yurii Evgen’evich Nesterov. A method of solving a convex programming problem with convergence rate o\\\backslash\bigl(k^2\\\backslash\bigr). In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
  24. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  25. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
  26. Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  27. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
  28. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  29. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  30. Improving language understanding by generative pre-training. ., 2018.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  32. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  33. Ilya Sutskever. Training recurrent neural networks. University of Toronto Toronto, ON, Canada, 2013.
  34. Paul Tseng. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  36. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets