NFNets: Normalizer-Free CNN Architecture

Updated 5 November 2025

NFNets are convolutional networks that eliminate batch normalization by using architectural innovations like scaled weight standardization and SkipInit to ensure variance preservation.
They employ adaptive gradient clipping to control per-unit gradients, enabling stable training even with large batch sizes and aggressive data augmentations.
Architectural enhancements and explicit regularization yield superior ImageNet performance and robust transfer learning benefits compared to traditional BN-based models.

Normalizer-Free Networks (NFNets) are convolutional neural network (CNN) architectures designed for large-scale image recognition that dispense with normalization layers such as Batch Normalization (BN). Through architectural modifications, careful initialization strategies, and the introduction of Adaptive Gradient Clipping (AGC), NFNets achieve state-of-the-art accuracy and training stability across a range of depths and widths, matching or exceeding the performance of the strongest batch-normalized baselines on ImageNet and in transfer learning settings.

1. Motivation: Limitations of Batch Normalization in Large-Scale CNNs

Batch normalization (BN) is a ubiquitous component of modern image classification models such as ResNets but has notable drawbacks, including dependence on batch size, inter-example interaction, increased memory/compute overhead, and discrepancies between training and inference. These issues result in reproducibility challenges, inefficiency at small scales, and compromised model performance under large batch sizes or aggressive augmentations. Prior attempts at training deep ResNets without normalization failed to match the accuracies of BN-equipped models and were often unstable at large learning rates or data augmentation strengths.

2. Architectural Principles and Structural Innovations

The base configuration of an NFNet is a highly optimized SE-ResNeXt-D variant, with substantial modifications to render the network "normalizer-free":

Suppression of Residual Branch Scale: Each residual block computes

$h_{i+1} = h_i + \alpha f_i \left( \frac{h_i}{\beta_i} \right)$

where $h_i$ is the input to block $i$ , $f_i$ is the residual function, $\alpha$ is a small scalar (approximately 0.2), and $\beta_i$ is the predicted input standard deviation at block $i$ .

Variance Preservation at Initialization: The variance of the residual function is initialized to match that of its input, i.e.,

$\mathrm{Var}(f_i(z)) = \mathrm{Var}(z)$

SkipInit: A learnable scalar (initialized to zero) at the end of each residual branch, improving optimizer stability and facilitating adaptation in very deep stacks.
Scaled Weight Standardization: Weight tensors are adjusted per output channel using

$\hat{W}_{ij} = \frac{W_{ij} - \mu_i}{\sqrt{N}\,\sigma_i}$

where $\mu_i$ and $\sigma_i$ are, respectively, the mean and standard deviation of weights over output channel $i$ , and $N$ is the fan-in.

No Activation Normalization: BN, LayerNorm, or similar normalization is absent from all residual paths and the network at large.
Explicit Regularization: Dropout and Stochastic Depth replace the implicit regularization provided by normalization layers.

In addition, Squeeze-and-Excite (SE) blocks are retained but scaled by 2 to maintain variance consistency, and group convolutions follow a ResNeXt-style design (fixed group width of 128 per $3 \times 3$ convolution).

3. Adaptive Gradient Clipping and Training Stability

Training instabilities in normalization-free architectures are principally attributable to uncontrolled gradient scaling. While conventional global norm clipping is brittle, NFNets employ Adaptive Gradient Clipping (AGC), which operates per-filter/channel/unit:

For each unit $i$ in a layer $\ell$ , let $G^\ell_i$ denote its gradient and $W^\ell_i$ its weight vector. AGC clips $G^\ell_i$ as follows: $G^\ell_i \rightarrow \begin{cases} \lambda \dfrac{\|W^\ell_i\|^\star_F}{\|G^\ell_i\|_F} G^\ell_i & \text{if } \dfrac{\|G^\ell_i\|_F}{\|W^\ell_i\|^\star_F} > \lambda \ G^\ell_i & \text{otherwise} \end{cases}$ where $\|G^\ell_i\|_F$ is the Frobenius norm of the gradient, $\|W^\ell_i\|^\star_F = \max(\|W^\ell_i\|_F, \epsilon)$ with $\epsilon = 10^{-3}$ to avoid zero-drift, and $\lambda$ is a hyperparameter.

AGC directly enables stable training with large batch sizes (up to 4096) and under strong data augmentations (e.g., RandAugment), making feasible the scale and regularization regimes that defeat prior normalization-free methods.

4. Detailed NFNet Architecture: Structural and Hyperparameter Regimes

NFNets, designated F0–F6, are scaled variants of the SE-ResNeXt-D backbone, further optimized for training efficiency, accuracy, and normalizer-free compatibility:

Backbone Depth and Width: Modified stagewise depth (e.g., [1, 2, 6, 3] in F0) and width configuration ([256, 512, 1536, 1536]), adaptively scaled for larger models.
Two $3 \times 3$ Grouped Convolutions per Block: Enhanced expressivity and feature propagation within each residual block.
Downsampling Strategy: Each transition block begins new stages with average pooling, stride increment, and channel expansion.
Absence of BN or Normalization: Scaled Weight Standardization and the $\alpha$ , $\beta$ scaling in residual branches replace normalization layers for maintaining controlled signal propagation.
Explicit Regularization: Dropout and Stochastic Depth rates increase with model size, compensating for the removal of normalization-induced regularization.

Table summarizing selected architecture and training characteristics:

Feature	Value/Description	Remarks
Group Convolution Width	128 (per $3 \times 3$ conv)	ResNeXt-style
Downsampling	AvgPool + channel increase	Transition block
Squeeze-and-Excite Block	SE ratio scaled by 2	Maintains variance
Regularization	Dropout, Stochastic Depth	Scaled by model size
No normalization layers	—	BN, LayerNorm, etc., absent
AGC	$\lambda$ hyperparameter controls clipping	Required for stability

5. Empirical Performance: Accuracy, Speed, and Robustness

NFNets achieve leading performance on large-scale recognition tasks:

ImageNet Top-1 Accuracy:
- NFNet-F1 matches EfficientNet-B7's 84.7% Top-1 but trains 8.7× faster (158.5 ms vs 1397 ms on TPUv3).
- NFNet-F5+SAM: 86.3%; NFNet-F6+SAM: 86.5% (state-of-the-art without extra data).
Batch-normalized models of similar design are slower (20–40%), less accurate, and less stable at scale.
Large-scale Transfer Learning: Following pre-training on 300M images, NFNet-F4+ attains 89.2% Top-1 (second only to specialized or semi-supervised models).
Training Stability: NFNets sustain stability at large batch sizes and under strong data augmentations; AGC is the key enabler.

Performance summary (abridged from paper):

Model	Top-1 (%)	Train Latency (ms, TPUv3)	Notes
EffNet-B7	84.7	1397	EfficientNet baseline
NFNet-F1	84.7	158.5	8.7× faster than B7
NFNet-F6+SAM	86.5	2774	SOTA, no extra data

A plausible implication is that, for a given compute budget, NFNets yield superior accuracy and training velocity compared to both batch-normalized CNNs and EfficientNet models.

6. Theoretical Formulations and Mathematical Framework

Core mathematical contributions underlying stability and expressivity in NFNets are:

Residual Block Without Normalization:

$h_{i+1} = h_i + \alpha f_i \left( \frac{h_i}{\beta_i} \right)$

with $\alpha \approx 0.2$ , $\beta_i = \sqrt{\mathrm{Var}(h_i)}$ .

Scaled Weight Standardization:

$\hat{W}_{ij} = \frac{W_{ij} - \mu_i}{\sqrt{N}\,\sigma_i}$

where $\mu_i$ is the output channel mean, $N$ the fan-in.

AGC Update Rule:

$G_i^\ell \rightarrow \begin{cases} \lambda \frac{\|W^\ell_i\|^\star_F}{\|G^\ell_i\|_F} G^\ell_i & \frac{\|G^\ell_i\|_F}{\|W^\ell_i\|^\star_F} > \lambda \ G^\ell_i & \text{otherwise} \end{cases}$

These strategies jointly address the optimization difficulty introduced by the removal of normalization, enabling deep and wide CNNs to propagate signal and gradients effectively during training.

7. Contributions, Significance, and Implications

The principal contributions of the NFNets architecture and methodology are:

Adaptive Gradient Clipping (AGC): Introduces per-unit gradient clipping, promoting training stability without BN, especially at large batch sizes or with heavy augmentation.
State-of-the-art Normalizer-Free ResNets: A suite of architectural modifications—including residual suppression, weight standardization, explicit regularization, and initialization enhancements—yields models that match or exceed BN-based and EfficientNet baselines on standard benchmarks and at faster training speeds.
Superiority in Transfer Learning: Normalizer-free models outperform batch-normalized counterparts by up to 1% Top-1 accuracy following large-scale pretraining, demonstrating robustness and adaptability.
Disadvantages of BN Highlighted: NFNets avoid BN's issues of increased memory usage, non-reproducibility, and impaired transferability, while achieving similar or better results.

NFNets establish that normalization is not a necessary requirement for state-of-the-art performance and scalability in deep network architectures. The combination of variance control, scaled weight standardization, strategic initialization, explicit regularization, and AGC provides a comprehensive methodology for constructing and training normalization-free deep CNNs suited for large and data-rich learning environments.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to NFNets Architecture Design.