diffGrad: An Optimization Method for Convolutional Neural Networks (1909.11015v4)

Published 12 Sep 2019 in cs.LG, cs.CV, cs.NE, and math.OC

Abstract: Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.

PDF Abstract

diffGrad: An Optimization Method for Convolutional Neural Networks

The paper presents a novel optimization method, diffGrad, aimed at enhancing the performance of Convolutional Neural Networks (CNNs) by addressing certain limitations inherent in existing stochastic gradient descent (SGD) techniques, particularly Adam. This research introduces diffGrad, which adapts learning rates based on recent changes in gradient values, thereby providing a dynamic adjustment mechanism that enhances convergence characteristics and stability in optimization landscapes.

Overview of diffGrad

The core innovation of diffGrad lies in leveraging the difference between the present and the immediate past gradient values, thus allowing for adaptive learning rates that reflect the optimization stage more accurately. Where current optimizers like Adam and AMSGrad might fail to efficiently utilize changes in short-term gradient behavior, diffGrad tailors its step size adjustments to these changes. This mechanism is crucial for mitigating overshooting issues near the optimum or dealing with saddle points that frequently occur in high-dimensional, non-convex optimization landscapes typical of neural network training.

Technical Contributions

diffGrad Friction Coefficient (DFC): The introduction of a friction coefficient based on the absolute difference between consecutive gradient values, $\Delta g_{t,i}$ . This coefficient modulates the learning rate, reducing it when gradient changes are modest and increasing it when the optimization trajectory is unstable or far from convergence.
Adaptive Step Sizing: The adaptive mechanism of diffGrad ensures that parameters undergoing rapid gradient changes are updated with larger steps, while parameters with slower-changing gradients are adjusted more conservatively, enhancing the optimizer's efficiency and stability.
Convergence Analysis: The authors provide a theoretical framework for the convergence of diffGrad using the regret bound approach within the online learning framework, demonstrating an $O(\sqrt{T})$ regret bound, which is competitive with contemporary adaptive gradient methods.
Empirical Validation: Through rigorous empirical analysis on synthetic non-convex functions, diffGrad demonstrates superior performance in avoiding local minima and achieving stable convergence compared to traditional Adam.
Implementation and Availability: The implementation of diffGrad has been made publicly accessible, allowing for its integration into further research and practical applications in deep learning.

Implications and Future Directions

The introduction of diffGrad offers significant implications for the training of deep neural networks, especially in areas where saddle points and local minima pose substantial challenges. The method's ability to dynamically adjust learning rates based on immediate past gradient information presents a robust solution for practitioners aiming to improve model convergence and generalization.

Practically, diffGrad could see application across various domains within artificial intelligence, such as image processing, natural language processing, and even robotics, where complex non-convex optimization tasks are prevalent. The proposed methodology is also versatile enough to be explored further within other neural architectures, including those not directly assessed in this paper, such as Recurrent Neural Networks (RNNs) or Transformer models.

Conclusion

By addressing critical limitations in existing optimization techniques, diffGrad emerges as a promising method enhancing the efficacy of CNN training. The synergistic combination of gradient difference information with adaptive moment estimation offers a pathway to more stable and efficient learning behavior, advancing optimization strategies in the field of deep learning. Future research could explore broader applications of this optimization method across diverse neural network architectures and further investigate its theoretical properties and potential enhancements.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Shiv Ram Dubey (55 papers)
Soumendu Chakraborty (14 papers)
Swalpa Kumar Roy (24 papers)
Snehasis Mukherjee (22 papers)
Satish Kumar Singh (32 papers)
Bidyut Baran Chaudhuri (8 papers)

Citations (171)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - shivram1987/diffGrad: diffGrad: An Optimization Method for Convolutional Neural Networks (54 stars)