Bidirectional Self-Normalization in Deep Networks

Updated 11 May 2026

The paper introduces BSNNs that integrate orthogonal weight matrices and Gaussian–Poincaré normalized activations to maintain constant norms in both forward and backward passes.
It demonstrates that enforcing moment constraints on activations prevents vanishing/exploding gradients, enabling provable stability even in very deep networks.
Empirical results on synthetic and real datasets confirm that BSNNs achieve high training accuracy without batch normalization, thanks to their robust norm-preservation properties.

Bidirectionally Self-Normalizing Neural Networks (BSNNs) constitute a class of multilayer feedforward networks that, under mild structural constraints on width and activation functions, achieve high-probability preservation of both forward activations and backward gradients across arbitrarily deep architectures. BSNNs integrate orthogonal weight matrices and a new class of scalar activation functions—termed Gaussian–Poincaré normalized (GPN) functions—to eliminate vanishing and exploding gradient pathologies in wide, deep nonlinear networks, enabling provable stability of signal propagation without batch normalization or specialized initializations (Lu et al., 2020).

1. Gaussian–Poincaré Normalized Activation Functions

Let $\phi: \mathbb{R} \rightarrow \mathbb{R}$ denote a differentiable scalar activation function. $\phi$ is said to be Gaussian–Poincaré normalized if, when its input $x$ is standard normal ( $x \sim \mathcal{N}(0,1)$ ), the following moment constraints are satisfied: $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ These constraints guarantee, by the Gaussian–Poincaré (Bogachev 1998) inequality,

$\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],$

that $\phi$ neither amplifies nor attenuates signal energy in expectation. Given any standard activation function with finite, nonzero second moments, scalars $a, b$ can be efficiently computed to construct a GPN-variant $\psi(x) = a \phi(x) + b$ that meets both normalization conditions. Explicit coefficients for common activation functions are tabulated below:

Activation	$a$	$\phi$ 0
ReLU-GPN	1.4142	0
LeakyReLU-GPN	1.4141	0
ELU-GPN	1.2234	0.0742
SELU-GPN	0.9660	0.2585
GELU-GPN	1.4915	-0.9097
Tanh-GPN	1.4674	0.3885

2. Network Construction and Bidirectional Self-Normalization

BSNNs are fully-connected, bias-free networks of $\phi$ 1 layers. Each layer $\phi$ 2 computes

$\phi$ 3

where $\phi$ 4 is strictly or approximately orthogonal. For back-propagation, let $\phi$ 5, with $\phi$ 6 and $\phi$ 7. Bidirectional self-normalization is defined as the property that, for all intermediate layers $\phi$ 8,

$\phi$ 9

This ensures constant $x$ 0 norm for both forward activations and backward error signals, eliminating depth-dependent signal decay or explosion. A key proposition follows: under bidirectional self-normalization, all weight-gradient Frobenius norms are equal, i.e.,

$x$ 1

Orthogonal weight matrices preserve input norms exactly ( $x$ 2), and GPN activations ensure no expected scaling distortion, collectively enforcing bidirectional norm preservation.

3. High-Dimensional Probabilistic Guarantees

Under the following assumptions:

Inputs $x$ 3 exhibit thin-shell concentration,
Each $x$ 4 is Haar-distributed in $x$ 5,
$x$ 6 and $x$ 7 are Lipschitz and GPN,
The back-propagation vector $x$ 8 has bounded $x$ 9-norm,

the forward and backward norms are preserved with overwhelming probability as $x \sim \mathcal{N}(0,1)$ 0. Theorem 2 (forward norm preservation) and Theorem 3 (backward norm preservation) assert: $x \sim \mathcal{N}(0,1)$ 1 The proof leverages the sphere-concentration of high-dimensional Gaussians, sub-Gaussianity of GPN activations, and union bounds to control deviations across all $x \sim \mathcal{N}(0,1)$ 2 layers.

4. Scaling Law: Network Width versus Depth

To ensure joint forward and backward norm preservation over $x \sim \mathcal{N}(0,1)$ 3 layers, the desired total failure probability $x \sim \mathcal{N}(0,1)$ 4 leads to a scaling relation: $x \sim \mathcal{N}(0,1)$ 5 In practice, this implies that the required network width $x \sim \mathcal{N}(0,1)$ 6 grows only logarithmically with depth $x \sim \mathcal{N}(0,1)$ 7 and the inverse error probabilities, making extremely deep yet stable BSNNs feasible with widths merely in the hundreds even at $x \sim \mathcal{N}(0,1)$ 8.

5. Empirical Evaluations

Experiments on both synthetic and real-world datasets corroborate the theoretical results:

Synthetic (untrained): For $x \sim \mathcal{N}(0,1)$ 9, $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 0, with random orthogonal $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 1, non-GPN activations yield vanishing/exploding gradient norms, whereas GPN counterparts maintain both forward and backward norms near unity. As width $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 2 increases from 100 to 1500, the gradient-norm ratio approaches 1, consistent with the scaling law.
MNIST and CIFAR-10: A fully-connected architecture with 200 GPN layers (width 500), trained using SGD with momentum, demonstrates that standard activations (ReLU, LeakyReLU, GELU) typically fail, with vanishing gradients and near-random accuracy. GPN-converted activations are trainable and achieve substantially higher accuracies (e.g., SELU-GPN reaches 99% train accuracy on MNIST and 98% train/46% test on CIFAR-10). Relaxing strict orthogonality—using row normalization—still maintains the self-normalization property.
Batch normalization (BN): Incorporating BN typically exacerbates gradient explosion, confirming that BSNNs obviate the necessity of BN for stability.

6. Implementation Methodology and Practical Recommendations

Major implementation steps are as follows:

Select the desired base activation $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 3 and determine $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 4 for $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 5 such that $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 6. Monte-Carlo integration followed by solving two quadratic constraints suffices.
Initialize each $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 7 as exactly orthogonal (e.g., QR decomposition of a Gaussian matrix) or approximately via row normalization for computational ease.
Ensure (empirically typical) Lipschitz properties for $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 8 and $\mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi(x)^2] = 1,\qquad \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2] = 1.$ 9.
Choose width $\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],$ 0 for the target depth $\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],$ 1 and distortion $\mathrm{Var}_{x \sim \mathcal{N}(0,1)}[\phi(x)] \le \mathbb{E}_{x \sim \mathcal{N}(0,1)}[\phi'(x)^2],$ 2. Generally, a width of several times the depth is sufficient.
Batch normalization is not required; BSNNs are stable absent BN even for networks with 200+ hidden layers.

7. Broader Implications and Distinctions

BSNNs resolve the vanishing/exploding gradient problem in deep, wide multilayer architectures through a rigorously grounded paradigm, combining orthogonalization and moment-matching activation redesign. They do not require batch normalization or pseudo-linear initializations and provide theoretical and practical scalability as depth increases. Empirical results reinforce that GPN activations in combination with orthogonal weights are both necessary and sufficient for sustained gradient flow and effective end-to-end training in overwhelmingly deep nonlinear networks (Lu et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Bidirectionally Self-Normalizing Neural Networks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bidirectionally Self-Normalizing Neural Networks.

Bidirectional Self-Normalization in Deep Networks

1. Gaussian–Poincaré Normalized Activation Functions

2. Network Construction and Bidirectional Self-Normalization

3. High-Dimensional Probabilistic Guarantees

4. Scaling Law: Network Width versus Depth

5. Empirical Evaluations

6. Implementation Methodology and Practical Recommendations

7. Broader Implications and Distinctions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bidirectional Self-Normalization in Deep Networks

1. Gaussian–Poincaré Normalized Activation Functions

2. Network Construction and Bidirectional Self-Normalization

3. High-Dimensional Probabilistic Guarantees

4. Scaling Law: Network Width versus Depth

5. Empirical Evaluations

6. Implementation Methodology and Practical Recommendations

7. Broader Implications and Distinctions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research