Gradient Centralization: A New Optimization Technique for Deep Neural Networks (2004.01461v2)

Published 3 Apr 2020 in cs.CV

Abstract: Optimization techniques are of great importance to effectively and efficiently train a deep neural network (DNN). It has been shown that using the first and second order statistics (e.g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance. Different from these existing methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. We show that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs. Moreover, GC improves the Lipschitzness of the loss function and its gradient so that the training process becomes more efficient and stable. GC is very simple to implement and can be easily embedded into existing gradient based DNN optimizers with only one line of code. It can also be directly used to fine-tune the pre-trained DNNs. Our experiments on various applications, including general image classification, fine-grained image classification, detection and segmentation, demonstrate that GC can consistently improve the performance of DNN learning. The code of GC can be found at https://github.com/Yonghongwei/Gradient-Centralization.

Authors (4)

Hongwei Yong (12 papers)
Jianqiang Huang (62 papers)
Xiansheng Hua (26 papers)
Lei Zhang (1689 papers)

Citations (173)

View on Semantic Scholar

Summary

The paper introduces Gradient Centralization, a method that centralizes gradient vectors to zero mean, smoothing the optimization landscape.
It demonstrates that GC acts as a regularizer, enhancing the Lipschitz continuity of the loss function and improving DNN generalization.
Empirical results across benchmarks show that GC boosts training efficiency and model performance in tasks like image classification and detection.

Gradient Centralization: A New Optimization Technique for Deep Neural Networks

The paper introduces a novel optimization method called Gradient Centralization (GC) for deep neural networks (DNNs). Unlike traditional methods focusing on activations or weights, GC refines the training process by directly centralizing gradient vectors to zero mean. This approach is positioned as an extension to existing techniques, such as Batch Normalization (BN) and Weight Standardization (WS), which regularize different aspects of the training process.

Core Contributions and Findings

Optimization Technique:
- GC operates on gradients, centralizing them with zero mean. This centralization is proposed as a projected gradient descent on a constrained loss function, enhancing the optimization landscape's smoothness and stability.
- The implementation of GC is straightforward, requiring only a minor modification in existing gradient-based optimizers, making it both effective and efficient.
Regularization Effects:
- GC acts as a regularizer by constraining the weight space and output feature space. This constraint potentially improves the generalization capability of DNNs.
- Theoretical analysis suggests that GC enhances the Lipschitzness of the loss function and its gradients, theoretically supporting its stability benefits.
Empirical Validation:
- Experimental results across various benchmarks show GC's consistent improvement in training efficiency and model performance, including tasks like image classification, detection, and segmentation.
- The fine-grained image classification tests, for instance, demonstrate notable performance boosts, asserting GC’s practical utility.

Implications and Future Directions

The paper's findings suggest several key implications for the development of optimization techniques in DNNs:

Broader Applicability: Since GC can seamlessly integrate with popular optimizers like SGDM and Adam, it broadens applicability across different models and tasks without necessitating major architectural changes.
Complementary Usage: GC’s compatibility with existing techniques such as BN and WS allows for potentially synergistic applications, leveraging the strengths of multiple methods.
Future Exploration: As an extensible method, future research could explore GC's integration with emerging DNN architectures or its effects on specialized tasks such as transfer learning or meta-learning.

Conclusion

Gradient Centralization introduces an innovative perspective in DNN optimization by focusing on gradient centralization. Its low implementation complexity combined with theoretical and empirical backing provides a compelling case for its adoption in diverse machine learning applications. The paper opens avenues for further experimental and theoretical exploration, establishing GC as a valuable tool in deep learning optimization.

PDF Markdown

Related Papers

GitHub

GitHub - Yonghongwei/Gradient-Centralization: A New Optimization Technique for Deep Neural Networks (538 stars)