Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks (1603.01431v6)

Published 4 Mar 2016 in stat.ML and cs.LG

Abstract: While the authors of Batch Normalization (BN) identify and address an important problem involved in training deep networks-- Internal Covariate Shift-- the current solution has certain drawbacks. Specifically, BN depends on batch statistics for layerwise input normalization during training which makes the estimates of mean and standard deviation of input (distribution) to hidden layers inaccurate for validation due to shifting parameter values (especially during initial training epochs). Also, BN cannot be used with batch-size 1 during training. We address these drawbacks by proposing a non-adaptive normalization technique for removing internal covariate shift, that we call Normalization Propagation. Our approach does not depend on batch statistics, but rather uses a data-independent parametric estimate of mean and standard-deviation in every layer thus being computationally faster compared with BN. We exploit the observation that the pre-activation before Rectified Linear Units follow Gaussian distribution in deep networks, and that once the first and second order statistics of any given dataset are normalized, we can forward propagate this normalization without the need for recalculating the approximate statistics for hidden layers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Devansh Arpit (31 papers)
  2. Yingbo Zhou (81 papers)
  3. Bhargava U. Kota (1 paper)
  4. Venu Govindaraju (22 papers)
Citations (124)

Summary

Overview of Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

The paper "Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks" by Devansh Arpit, Yingbo Zhou, Bhargava U. Kota, and Venu Govindaraju introduces Normalization Propagation (NormProp), a method that addresses the Internal Covariate Shift (ICS) problem in deep neural networks. This technique is proposed as an alternative to Batch Normalization (BN), aiming to resolve the limitations associated with BN, particularly for small batch sizes.

Internal Covariate Shift and Batch Normalization

The ICS problem refers to the fluctuation of input distributions to hidden layers in deep networks, which can slow down training convergence. BN addresses ICS by normalizing hidden layers' inputs based on mini-batch statistics, but this method poses challenges. First, the reliance on batch statistics may lead to inaccurate estimates of the data distribution during validation, especially in the early training phases. Second, BN is incompatible with batch sizes of one.

Normalization Propagation Approach

NormProp introduces a non-adaptive, computationally efficient normalization method that does not rely on batch statistics. Instead, it leverages the assumption that pre-activation inputs to layers follow a Gaussian distribution. By utilizing closed-form data-independent estimates of means and variances, calculated under the assumption of Gaussian-distributed pre-activations, this normalization can be propagated throughout the network layers.

Key features of NormProp include:

  • Data Normalization: The method starts with normalizing the dataset's global mean and standard deviation, which allows for batch size flexibility, including single-batch training.
  • Parametric Layer-wise Normalization: Unlike BN, NormProp uses predetermined parametric normalization didn't require recalculating statistics, focusing on computational efficiency.
  • Jacobian Regularity: NormProp ensures that the Jacobians of the transformations between layers have singular values close to one, which helps in maintaining gradient magnitudes and facilitates efficient training.

Empirical Evaluation

The authors conducted experiments on datasets like CIFAR-10, CIFAR-100, and SVHN using Network-in-Network architecture variations. NormProp's performance was comparable to or exceeded BN, especially with smaller batch sizes. Training with NormProp was also computationally faster, demonstrating its practical benefits in both performance and efficiency.

Implications and Future Prospects

The implications of this research are noteworthy:

  • Practical Benefits: NormProp offers an efficient solution for networks where batch size is limited or computational resources are constrained.
  • Theoretical Insights: It extends the understanding of normalization techniques in deep learning, suggesting efficiency gains through proper initialization and strategic normalization without reliance on dynamic batch statistics.

Future directions could explore extending NormProp to more activation functions beyond ReLU and evaluating its utility in architectures beyond convolutional networks. Further optimizations may be possible by refining data-independent estimates or integrating adaptive components while maintaining unique benefits.

Overall, this work contributes a valuable technique to the arsenal of deep learning methods, providing an avenue for consistent and efficient neural network training in varied settings.

Youtube Logo Streamline Icon: https://streamlinehq.com