Bidirectional Self-Normalization in Deep Networks
- The paper introduces BSNNs that integrate orthogonal weight matrices and Gaussian–Poincaré normalized activations to maintain constant norms in both forward and backward passes.
- It demonstrates that enforcing moment constraints on activations prevents vanishing/exploding gradients, enabling provable stability even in very deep networks.
- Empirical results on synthetic and real datasets confirm that BSNNs achieve high training accuracy without batch normalization, thanks to their robust norm-preservation properties.
Bidirectionally Self-Normalizing Neural Networks (BSNNs) constitute a class of multilayer feedforward networks that, under mild structural constraints on width and activation functions, achieve high-probability preservation of both forward activations and backward gradients across arbitrarily deep architectures. BSNNs integrate orthogonal weight matrices and a new class of scalar activation functions—termed Gaussian–Poincaré normalized (GPN) functions—to eliminate vanishing and exploding gradient pathologies in wide, deep nonlinear networks, enabling provable stability of signal propagation without batch normalization or specialized initializations (Lu et al., 2020).
1. Gaussian–Poincaré Normalized Activation Functions
Let denote a differentiable scalar activation function. is said to be Gaussian–Poincaré normalized if, when its input is standard normal (), the following moment constraints are satisfied: These constraints guarantee, by the Gaussian–Poincaré (Bogachev 1998) inequality,
that neither amplifies nor attenuates signal energy in expectation. Given any standard activation function with finite, nonzero second moments, scalars can be efficiently computed to construct a GPN-variant that meets both normalization conditions. Explicit coefficients for common activation functions are tabulated below:
| Activation | 0 | |
|---|---|---|
| ReLU-GPN | 1.4142 | 0 |
| LeakyReLU-GPN | 1.4141 | 0 |
| ELU-GPN | 1.2234 | 0.0742 |
| SELU-GPN | 0.9660 | 0.2585 |
| GELU-GPN | 1.4915 | -0.9097 |
| Tanh-GPN | 1.4674 | 0.3885 |
2. Network Construction and Bidirectional Self-Normalization
BSNNs are fully-connected, bias-free networks of 1 layers. Each layer 2 computes
3
where 4 is strictly or approximately orthogonal. For back-propagation, let 5, with 6 and 7. Bidirectional self-normalization is defined as the property that, for all intermediate layers 8,
9
This ensures constant 0 norm for both forward activations and backward error signals, eliminating depth-dependent signal decay or explosion. A key proposition follows: under bidirectional self-normalization, all weight-gradient Frobenius norms are equal, i.e.,
1
Orthogonal weight matrices preserve input norms exactly (2), and GPN activations ensure no expected scaling distortion, collectively enforcing bidirectional norm preservation.
3. High-Dimensional Probabilistic Guarantees
Under the following assumptions:
- Inputs 3 exhibit thin-shell concentration,
- Each 4 is Haar-distributed in 5,
- 6 and 7 are Lipschitz and GPN,
- The back-propagation vector 8 has bounded 9-norm,
the forward and backward norms are preserved with overwhelming probability as 0. Theorem 2 (forward norm preservation) and Theorem 3 (backward norm preservation) assert: 1 The proof leverages the sphere-concentration of high-dimensional Gaussians, sub-Gaussianity of GPN activations, and union bounds to control deviations across all 2 layers.
4. Scaling Law: Network Width versus Depth
To ensure joint forward and backward norm preservation over 3 layers, the desired total failure probability 4 leads to a scaling relation: 5 In practice, this implies that the required network width 6 grows only logarithmically with depth 7 and the inverse error probabilities, making extremely deep yet stable BSNNs feasible with widths merely in the hundreds even at 8.
5. Empirical Evaluations
Experiments on both synthetic and real-world datasets corroborate the theoretical results:
- Synthetic (untrained): For 9, 0, with random orthogonal 1, non-GPN activations yield vanishing/exploding gradient norms, whereas GPN counterparts maintain both forward and backward norms near unity. As width 2 increases from 100 to 1500, the gradient-norm ratio approaches 1, consistent with the scaling law.
- MNIST and CIFAR-10: A fully-connected architecture with 200 GPN layers (width 500), trained using SGD with momentum, demonstrates that standard activations (ReLU, LeakyReLU, GELU) typically fail, with vanishing gradients and near-random accuracy. GPN-converted activations are trainable and achieve substantially higher accuracies (e.g., SELU-GPN reaches 99% train accuracy on MNIST and 98% train/46% test on CIFAR-10). Relaxing strict orthogonality—using row normalization—still maintains the self-normalization property.
- Batch normalization (BN): Incorporating BN typically exacerbates gradient explosion, confirming that BSNNs obviate the necessity of BN for stability.
6. Implementation Methodology and Practical Recommendations
Major implementation steps are as follows:
- Select the desired base activation 3 and determine 4 for 5 such that 6. Monte-Carlo integration followed by solving two quadratic constraints suffices.
- Initialize each 7 as exactly orthogonal (e.g., QR decomposition of a Gaussian matrix) or approximately via row normalization for computational ease.
- Ensure (empirically typical) Lipschitz properties for 8 and 9.
- Choose width 0 for the target depth 1 and distortion 2. Generally, a width of several times the depth is sufficient.
- Batch normalization is not required; BSNNs are stable absent BN even for networks with 200+ hidden layers.
7. Broader Implications and Distinctions
BSNNs resolve the vanishing/exploding gradient problem in deep, wide multilayer architectures through a rigorously grounded paradigm, combining orthogonalization and moment-matching activation redesign. They do not require batch normalization or pseudo-linear initializations and provide theoretical and practical scalability as depth increases. Empirical results reinforce that GPN activations in combination with orthogonal weights are both necessary and sufficient for sustained gradient flow and effective end-to-end training in overwhelmingly deep nonlinear networks (Lu et al., 2020).