Micro-Batch Training with Batch-Channel Normalization and Weight Standardization (1903.10520v2)

Published 25 Mar 2019 in cs.CV and cs.LG

Abstract: Batch Normalization (BN) has become an out-of-box technique to improve deep network training. However, its effectiveness is limited for micro-batch training, i.e., each GPU typically has only 1-2 images for training, which is inevitable for many computer vision tasks, e.g., object detection and semantic segmentation, constrained by memory consumption. To address this issue, we propose Weight Standardization (WS) and Batch-Channel Normalization (BCN) to bring two success factors of BN into micro-batch training: 1) the smoothing effects on the loss landscape and 2) the ability to avoid harmful elimination singularities along the training trajectory. WS standardizes the weights in convolutional layers to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients; BCN combines batch and channel normalizations and leverages estimated statistics of the activations in convolutional layers to keep networks away from elimination singularities. We validate WS and BCN on comprehensive computer vision tasks, including image classification, object detection, instance segmentation, video recognition and semantic segmentation. All experimental results consistently show that WS and BCN improve micro-batch training significantly. Moreover, using WS and BCN with micro-batch training is even able to match or outperform the performances of BN with large-batch training.

Citations (142)

View on Semantic Scholar

Summary

The paper presents novel WS and BCN techniques that smooth the loss landscape and stabilize training under micro-batch conditions.
Weight Standardization normalizes convolutional weights to lower Lipschitz constants and accelerate convergence in deep networks.
Batch-Channel Normalization integrates batch and channel statistics to avoid elimination singularities and enhance model performance in limited-memory scenarios.

Micro-Batch Training with Batch-Channel Normalization and Weight Standardization

The discussed paper focuses on advancing techniques for deep network training, particularly addressing limitations encountered during micro-batch training due to memory constraints. Conventional Batch Normalization (BN) has been pivotal in promoting efficient deep network training; however, its effectiveness dwindles under micro-batch settings where each GPU processes only a few images at a time. This paper introduces Weight Standardization (WS) and Batch-Channel Normalization (BCN) as two innovative approaches aimed at mitigating these limitations, particularly by integrating two core benefits of BN—loss landscape smoothing and evasion of harmful elimination singularities—into micro-batch training paradigms.

Weight Standardization (WS)

Weight Standardization acts by normalizing the weights of convolutional layers to achieve a zero mean and unit variance, thus directly smoothing the loss landscape by lowering the Lipschitz constants of both the loss function and gradients. This inherently stabilizes the optimization process, allowing for an accelerated and more reliable convergence. The paper theoretically supports WS's effectiveness by demonstrating its capabilities to minimize weights' Lipschitz constants and provides empirical results showcasing its enhancement in training speed and model performance across various vision tasks.

Batch-Channel Normalization (BCN)

Batch-Channel Normalization is proposed as an evolution of traditional normalization approaches by incorporating estimated activation statistics alongside channel normalizations in convolutional layers. Unlike batch-only or channel-only techniques, BCN synergizes the two, maintaining network parameters a safe distance from elimination singularities that often impede deep network training. This property is vital under micro-batch scenarios, where BCN’s strategy of employing estimated statistical benchmarks ensures model stability and improved training results.

Empirical Evaluation

Comprehensive experiments conducted on key computer vision benchmarks, such as ImageNet for image classification and COCO for object detection and instance segmentation, confirm the practical advantages of the proposed methods. WS, notably in conjunction with GN, demonstrates its ability to achieve training results comparable to BN, traditionally reliant on larger batch sizes. Moreover, the additional enhancements offered by integrating BCN further improve performance, notably in models that employ micro-batch training. These results underline WS and BCN's potential in enabling efficient training of deep neural networks when computational resources are limited.

Implications and Future Developments

The paper's findings extend significant implications for developing AI systems, particularly in computational settings where memory is constrained. By providing scalable normalization solutions that do not rely on large batch sizes, this research paves the way for more efficient deployment of high-performing models across various computer vision applications. Additionally, WS and BCN provide fundamental tools for smoothing loss landscapes and avoiding singularities, which are promising areas for further exploration in broader AI systems, including natural language processing and generative modeling.

In conclusion, this paper contributes valuable advancements in normalization techniques that adapt batch learning efficiencies into micro-batch contexts, setting a framework for future research into making deep learning more adaptable and efficient in diverse computational environments.

PDF Markdown

Related Papers

YouTube

Show All Videos