Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bidirectional Self-Normalization in Deep Networks

Updated 11 May 2026
  • The paper introduces BSNNs that integrate orthogonal weight matrices and Gaussian–Poincaré normalized activations to maintain constant norms in both forward and backward passes.
  • It demonstrates that enforcing moment constraints on activations prevents vanishing/exploding gradients, enabling provable stability even in very deep networks.
  • Empirical results on synthetic and real datasets confirm that BSNNs achieve high training accuracy without batch normalization, thanks to their robust norm-preservation properties.

Bidirectionally Self-Normalizing Neural Networks (BSNNs) constitute a class of multilayer feedforward networks that, under mild structural constraints on width and activation functions, achieve high-probability preservation of both forward activations and backward gradients across arbitrarily deep architectures. BSNNs integrate orthogonal weight matrices and a new class of scalar activation functions—termed Gaussian–Poincaré normalized (GPN) functions—to eliminate vanishing and exploding gradient pathologies in wide, deep nonlinear networks, enabling provable stability of signal propagation without batch normalization or specialized initializations (Lu et al., 2020).

1. Gaussian–Poincaré Normalized Activation Functions

Let ϕ:RR\phi: \mathbb{R} \rightarrow \mathbb{R} denote a differentiable scalar activation function. ϕ\phi is said to be Gaussian–Poincaré normalized if, when its input xx is standard normal (xN(0,1)x \sim \mathcal{N}(0,1)), the following moment constraints are satisfied: ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1. These constraints guarantee, by the Gaussian–Poincaré (Bogachev 1998) inequality,

VarxN(0,1)[ϕ(x)]ExN(0,1)[ϕ(x)2],\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],

that ϕ\phi neither amplifies nor attenuates signal energy in expectation. Given any standard activation function with finite, nonzero second moments, scalars a,ba, b can be efficiently computed to construct a GPN-variant ψ(x)=aϕ(x)+b\psi(x) = a \phi(x) + b that meets both normalization conditions. Explicit coefficients for common activation functions are tabulated below:

Activation aa ϕ\phi0
ReLU-GPN 1.4142 0
LeakyReLU-GPN 1.4141 0
ELU-GPN 1.2234 0.0742
SELU-GPN 0.9660 0.2585
GELU-GPN 1.4915 -0.9097
Tanh-GPN 1.4674 0.3885

2. Network Construction and Bidirectional Self-Normalization

BSNNs are fully-connected, bias-free networks of ϕ\phi1 layers. Each layer ϕ\phi2 computes

ϕ\phi3

where ϕ\phi4 is strictly or approximately orthogonal. For back-propagation, let ϕ\phi5, with ϕ\phi6 and ϕ\phi7. Bidirectional self-normalization is defined as the property that, for all intermediate layers ϕ\phi8,

ϕ\phi9

This ensures constant xx0 norm for both forward activations and backward error signals, eliminating depth-dependent signal decay or explosion. A key proposition follows: under bidirectional self-normalization, all weight-gradient Frobenius norms are equal, i.e.,

xx1

Orthogonal weight matrices preserve input norms exactly (xx2), and GPN activations ensure no expected scaling distortion, collectively enforcing bidirectional norm preservation.

3. High-Dimensional Probabilistic Guarantees

Under the following assumptions:

  • Inputs xx3 exhibit thin-shell concentration,
  • Each xx4 is Haar-distributed in xx5,
  • xx6 and xx7 are Lipschitz and GPN,
  • The back-propagation vector xx8 has bounded xx9-norm,

the forward and backward norms are preserved with overwhelming probability as xN(0,1)x \sim \mathcal{N}(0,1)0. Theorem 2 (forward norm preservation) and Theorem 3 (backward norm preservation) assert: xN(0,1)x \sim \mathcal{N}(0,1)1 The proof leverages the sphere-concentration of high-dimensional Gaussians, sub-Gaussianity of GPN activations, and union bounds to control deviations across all xN(0,1)x \sim \mathcal{N}(0,1)2 layers.

4. Scaling Law: Network Width versus Depth

To ensure joint forward and backward norm preservation over xN(0,1)x \sim \mathcal{N}(0,1)3 layers, the desired total failure probability xN(0,1)x \sim \mathcal{N}(0,1)4 leads to a scaling relation: xN(0,1)x \sim \mathcal{N}(0,1)5 In practice, this implies that the required network width xN(0,1)x \sim \mathcal{N}(0,1)6 grows only logarithmically with depth xN(0,1)x \sim \mathcal{N}(0,1)7 and the inverse error probabilities, making extremely deep yet stable BSNNs feasible with widths merely in the hundreds even at xN(0,1)x \sim \mathcal{N}(0,1)8.

5. Empirical Evaluations

Experiments on both synthetic and real-world datasets corroborate the theoretical results:

  • Synthetic (untrained): For xN(0,1)x \sim \mathcal{N}(0,1)9, ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.0, with random orthogonal ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.1, non-GPN activations yield vanishing/exploding gradient norms, whereas GPN counterparts maintain both forward and backward norms near unity. As width ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.2 increases from 100 to 1500, the gradient-norm ratio approaches 1, consistent with the scaling law.
  • MNIST and CIFAR-10: A fully-connected architecture with 200 GPN layers (width 500), trained using SGD with momentum, demonstrates that standard activations (ReLU, LeakyReLU, GELU) typically fail, with vanishing gradients and near-random accuracy. GPN-converted activations are trainable and achieve substantially higher accuracies (e.g., SELU-GPN reaches 99% train accuracy on MNIST and 98% train/46% test on CIFAR-10). Relaxing strict orthogonality—using row normalization—still maintains the self-normalization property.
  • Batch normalization (BN): Incorporating BN typically exacerbates gradient explosion, confirming that BSNNs obviate the necessity of BN for stability.

6. Implementation Methodology and Practical Recommendations

Major implementation steps are as follows:

  • Select the desired base activation ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.3 and determine ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.4 for ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.5 such that ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.6. Monte-Carlo integration followed by solving two quadratic constraints suffices.
  • Initialize each ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.7 as exactly orthogonal (e.g., QR decomposition of a Gaussian matrix) or approximately via row normalization for computational ease.
  • Ensure (empirically typical) Lipschitz properties for ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.8 and ExN(0,1)[ϕ(x)2]=1,ExN(0,1)[ϕ(x)2]=1.\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.9.
  • Choose width VarxN(0,1)[ϕ(x)]ExN(0,1)[ϕ(x)2],\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],0 for the target depth VarxN(0,1)[ϕ(x)]ExN(0,1)[ϕ(x)2],\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],1 and distortion VarxN(0,1)[ϕ(x)]ExN(0,1)[ϕ(x)2],\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],2. Generally, a width of several times the depth is sufficient.
  • Batch normalization is not required; BSNNs are stable absent BN even for networks with 200+ hidden layers.

7. Broader Implications and Distinctions

BSNNs resolve the vanishing/exploding gradient problem in deep, wide multilayer architectures through a rigorously grounded paradigm, combining orthogonalization and moment-matching activation redesign. They do not require batch normalization or pseudo-linear initializations and provide theoretical and practical scalability as depth increases. Empirical results reinforce that GPN activations in combination with orthogonal weights are both necessary and sufficient for sustained gradient flow and effective end-to-end training in overwhelmingly deep nonlinear networks (Lu et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bidirectionally Self-Normalizing Neural Networks.