Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asymmetric Momentum: A Rethinking of Gradient Descent (2309.02130v2)

Published 5 Sep 2023 in cs.LG

Abstract: Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 2011, 12(7).
  2. RmsProp: Divide the gradient by a running average of its recent mag- nitude. COURSERA: Neural Networks for Machine Learning, 2012.
  3. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  4. On the Convergence of Adam and Beyond. International Conference on Learning Representations, 2018.
  5. Decoupled Weight Decay Regularization. International Conference on Learning Representations, 2018.
  6. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 2014, 27.
  7. A stochastic approximation method. The annals of mathematical statistics.
  8. Sgdr: Stochastic gradient descent with warm restarts. International Conference on Learning Representations. 2016.
  9. Wide residual networks. Procedings of the British Machine Vision Conference 2016.
  10. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778, 2016.
  11. Learning multiple layers of features from tiny images. Citeseer, 2009.
  12. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Gongyue Zhang (2 papers)
  2. Dinghuang Zhang (1 paper)
  3. Shuwen Zhao (1 paper)
  4. Donghan Liu (8 papers)
  5. Carrie M. Toptan (1 paper)
  6. Honghai Liu (24 papers)
Citations (1)