Characterizing signal propagation to close the performance gap in unnormalized ResNets (2101.08692v2)

Published 21 Jan 2021 in cs.LG, cs.CV, and stat.ML

Abstract: Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.

Citations (115)

View on Semantic Scholar

Summary

The paper introduces signal propagation plots (SPPs) to visualize signal evolution and diagnose issues like explosion or attenuation in deep networks.
It presents scaled weight standardization to maintain activation variance without relying on batch-level normalization.
Empirical results on ResNet and RegNet architectures show that NF-ResNets achieve competitive performance, especially with small batch sizes.

Overview of ResNets without Normalization Layers

This paper investigates the development of deep residual networks (ResNets) that eschew activation normalization layers, such as Batch Normalization (BatchNorm). While BatchNorm has become a pervasive component in state-of-the-art image classifiers, it introduces several practical drawbacks. These include dependence on batch size, additional computational overhead, and an inconsistency between training and inference behavior. To address these concerns, the authors propose Normalizer-Free ResNets (NF-ResNets) and introduce novel methods to ensure efficient signal propagation throughout the network.

Key Contributions

The paper's contributions can be summarized as follows:

Signal Propagation Plots (SPPs): The authors introduce SPPs, a visualization tool to analyze signal propagation in deep networks. This tool provides insights into how signals evolve through the network layers and helps identify issues such as signal explosion or attenuation, particularly in networks without normalization.
Scaled Weight Standardization: A modification of Weight Standardization is proposed to resolve issues observed in unnormalized networks. Scaled Weight Standardization ensures that the variance of activations is maintained without introducing mean shifts, a common issue in alternative normalization-free layers using ReLU or Swish activations.
Implementation in ResNet and RegNet Architectures: The authors validate their methods on both ResNet and RegNet architectures, achieving performance on par or better than their batch-normalized counterparts. They highlight the competitiveness of NF-ResNets across various network depths and FLOPS budgets, particularly emphasizing their advantageous performance in scenarios with small batch sizes.

Methodological Insights

The research extends the theoretical understanding of signal propagation in ResNets, proposing a model that adjusts for the removal of normalization layers at initialization. Central to their approach is an analytical approximation of the empirical variance during forward passes, calibrated using the skip connection to control signal variance. They employ Scaled Weight Standardization to maintain this variance across layers, offering an alternative to batch-level normalization while ensuring consistent training behavior. This modification demonstrates practical viability by incorporating well-studied techniques such as He initialization and SkipInit.

Experimental Validation and Results

Empirical experiments on the ImageNet benchmark reveal that NF-ResNets, with carefully controlled signal propagation via Scaled Weight Standardization and careful architecture design, can achieve comparably high test accuracies as batch-normalized networks. Notably, improvements are observed in scenarios with constrained computational resources or batch sizes. The NF-RegNet architecture designed in this paper effectively challenges the EfficientNet models under similar computational budgets, solidifying the promise of normalization-free approaches.

Implications and Future Directions

The results of this paper have broad implications for the future of deep learning model design. By providing a viable alternative to BatchNorm and similar algorithms, NF-ResNets reduce reliance on large batch sizes and allow for more consistent training across varying hardware configurations. Furthermore, the successes achieved with NF-RegNet suggest that future work exploring optimization and architecture design specifically for normalization-free networks could yield significant advancements in model efficiency and performance.

In summary, this paper presents a thorough exploration of designing deep networks without traditional activation normalization. It opens avenues for revisiting long-held assumptions about normalization dependencies, with the potential to lead to simpler, more robust, and hardware-friendly model architectures in the future.

PDF Markdown

Related Papers

YouTube

Show All Videos