Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Regularized Adaptive Momentum Dual Averaging with an Efficient Inexact Subproblem Solver for Training Structured Neural Network (2403.14398v2)

Published 21 Mar 2024 in cs.LG and math.OC

Abstract: We propose a Regularized Adaptive Momentum Dual Averaging (RAMDA) algorithm for training structured neural networks. Similar to existing regularized adaptive methods, the subproblem for computing the update direction of RAMDA involves a nonsmooth regularizer and a diagonal preconditioner, and therefore does not possess a closed-form solution in general. We thus also carefully devise an implementable inexactness condition that retains convergence guarantees similar to the exact versions, and propose a companion efficient solver for the subproblems of both RAMDA and existing methods to make them practically feasible. We leverage the theory of manifold identification in variational analysis to show that, even in the presence of such inexactness, the iterates of RAMDA attain the ideal structure induced by the regularizer at the stationary point of asymptotic convergence. This structure is locally optimal near the point of convergence, so RAMDA is guaranteed to obtain the best structure possible among all methods converging to the same point, making it the first regularized adaptive method outputting models that possess outstanding predictive performance while being (locally) optimally structured. Extensive numerical experiments in large-scale modern computer vision, LLMing, and speech tasks show that the proposed RAMDA is efficient and consistently outperforms state of the art for training structured neural network. Implementation of our algorithm is available at https://www.github.com/ismoptgroup/RAMDA/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems, 2019.
  2. Layer normalization. Technical report, 2016. arXiv:1607.06450.
  3. Beck, A. First-Order Methods in Optimization. SIAM - Society for Industrial and Applied Mathematics, Philadelphia, PA, United States, 2017.
  4. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  5. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  6. Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23(144):1–34, 2022.
  7. A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
  8. Structured sparsity inducing adaptive optimizers for deep learning. Technical report, 2021. arXiv:2102.03869.
  9. Stronger baselines for trustable results in neural machine translation. Technical report, 2017. arXiv:1706.09733.
  10. Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis, 11(2):251–266, 2004.
  11. Identifying active manifolds. Algorithmic Operations Research, 2(2):75–75, 2007.
  12. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, 2016.
  13. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  14. Training structured neural networks through manifold identification and variance reduction. In International Conference on Learning Representations, 2022.
  15. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456, 2015.
  16. The LJ speech dataset, 2017.
  17. Nonsmoothness in machine learning: specific structure, proximal identification, and applications. Set-Valued and Variational Analysis, 28(4):661–678, 2020.
  18. Dual averaging is surprisingly effective for deep learning optimization. Technical report, 2020. arXiv:2010.10502.
  19. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  20. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
  21. Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. In International Conference on Learning Representations, 2023.
  22. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  23. Lee, C.-p. Accelerating inexact successive quadratic approximation for regularized optimization through manifold identification. Mathematical Programming, 2023.
  24. Manifold identification in dual averaging for regularized stochastic online learning. Journal of Machine Learning Research, 13:1705–1744, 2012.
  25. Lewis, A. S. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3):702–725, 2002.
  26. Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  28. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  29. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
  30. Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
  31. Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
  32. Nurminskii, E. A. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9(1):145–150, 1973.
  33. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, pp. 8026–8037, 2019.
  34. Ac/dc: Alternating compressed/decompressed training of deep neural networks. In Advances in neural information processing systems, 2021.
  35. Prox-regular functions in variational analysis. Transactions of the American Mathematical Society, 348(5):1805–1838, 1996.
  36. Local convergence properties of SAGA/prox-SVRG and acceleration. In International Conference on Machine Learning, 2018.
  37. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  38. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  39. Ruszczyński, A. Feasible direction methods for stochastic programming problems. Mathematical Programming, 19:220–229, 1980.
  40. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018.
  41. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  42. Primal averaging: A new gradient evaluation step to attain the optimal individual convergence. IEEE transactions on cybernetics, 50(2):835–845, 2018.
  43. Attention is all you need. In Advances in neural information processing systems, 2017.
  44. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, pp. 2074–2082, 2016.
  45. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Technical report, 2017. arXiv:1708.07747.
  46. Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543–2596, 2010.
  47. SimMIM: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9653–9663, 2022.
  48. ProxSGD: Training structured neural networks under regularization and constraints. In International Conference on Learning Representations, 2019.
  49. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  50. Adaptive proximal gradient methods for structured neural networks. In Advances in Neural Information Processing Systems, 2021.
  51. Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets