Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parameter-free Clipped Gradient Descent Meets Polyak (2405.15010v2)

Published 23 May 2024 in cs.LG and math.OC

Abstract: Gradient descent and its variants are de facto standard algorithms for training machine learning models. As gradient descent is sensitive to its hyperparameters, we need to tune the hyperparameters carefully using a grid search. However, the method is time-consuming, particularly when multiple hyperparameters exist. Therefore, recent studies have analyzed parameter-free methods that adjust the hyperparameters on the fly. However, the existing work is limited to investigations of parameter-free methods for the stepsize, and parameter-free methods for other hyperparameters have not been explored. For instance, although the gradient clipping threshold is a crucial hyperparameter in addition to the stepsize for preventing gradient explosion issues, none of the existing studies have investigated parameter-free methods for clipped gradient descent. Therefore, in this study, we investigate the parameter-free methods for clipped gradient descent. Specifically, we propose Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of $L$ under $L$-smooth and $(L_0, L_1)$-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters. We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Training neural networks for and by interpolation. In International Conference on Machine Learning.
  2. Making SGD parameter-free. In Conference on Learning Theory.
  3. Learning-rate-free learning by D-adaptation. In International Conference on Machine Learning.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics.
  5. Adaptive subgradient methods for online learning and stochastic optimization. In Journal of Machine Learning Research.
  6. Don't be so monotone: Relaxing stochastic line search in over-parameterized models. In Advances in Neural Information Processing Systems.
  7. Handbook of convergence theorems for (stochastic) gradient methods. In arXiv.
  8. Revisiting the Polyak step size. In arXiv.
  9. DoG is SGD’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning.
  10. Adaptive SGD with Polyak stepsize and line-search: Robust convergence and variance reduction. In Advances in Neural Information Processing Systems.
  11. DoWG unleashed: An efficient universal parameter-free gradient descent method. In Advances in Neural Information Processing Systems.
  12. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  13. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learning.
  14. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. In International Conference on Machine Learning.
  15. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics.
  16. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations.
  17. Recurrent neural network based language model. In Interspeech.
  18. Locally adaptive federated learning via stochastic polyak stepsizes. In arXiv.
  19. Nawrot, P. (2023). NanoT5: A pytorch framework for pre-training and fine-tuning t5-style models with limited resources. In arXiv.
  20. Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.
  21. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems.
  22. Dynamics of SGD with stochastic Polyak stepsizes: Truly adaptive variants and convergence to exact solution. In Advances in Neural Information Processing Systems.
  23. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning.
  24. Polyak, B. (1987). Introduction to Optimization. Optimization Software.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. In Journal of Machine Learning Research.
  26. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning.
  27. Improved analysis of clipping algorithms for non-convex optimization. In Advances in Neural Information Processing Systems.
  28. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations.
  29. Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets