Shallower Neural Networks

Updated 26 November 2025

Shallower neural networks are architectures with reduced sequential layers that maintain competitive performance through parallelization and specialized transformation techniques.
Techniques such as layer folding, MSE-optimal fusion, and truncation convert deep models into shallow forms with minimal accuracy loss and faster convergence.
Despite their benefits in efficiency and hardware parallelism, shallow networks face expressivity limits and require exponential width for certain complex tasks compared to deep architectures.

Shallower neural networks are architectures characterized by a reduced number of sequential layers, typically adopted for optimization efficiency, hardware constraints, or task specialization. Theoretical and empirical studies have clarified both the capabilities and limitations of shallow architectures, sometimes revealing equivalence to deeper models under key architectural and initialization conditions, while also demonstrating fundamental boundaries in expressivity and optimization. Shallower variants span single-layer perceptrons, shallow residual networks, shallow graph convolutional networks (GCNs), and algorithms that transform deep models into shallower forms through truncation, folding, or fusion.

1. Theoretical Foundations and Equivalence to Deep Architectures

The performance equivalence of certain shallow architectures, notably those derived from deep residual networks, was established by the Taylor-like expansion of stacked residual blocks. A single residual block can be formalized as $y = I x + F(x)$ . For a stack of $H$ blocks, recursion yields

$z_h = I z_{h-1} + F_h(z_{h-1}), \quad h=1\dots H$

which, when linearized, is expressible as a series:

$y = I x + \sum_{h=1}^H F_h x + \sum_{1\le h<k\le H} F_k F_h x + \cdots + F_H\ldots F_1 x$

Truncating higher-order terms (keeping only first-order) leads to

$y \approx I x + \sum_{h=1}^H F_h x$

This implies that the representational power of a deep stack is preserved by a single layer composed of $H$ parallel branches, each replicating one residual module. Extensive experiments on MNIST and CIFAR-10, with 6912 architectural variants, demonstrated near-identical performance (within $0.5\%$ accuracy) between deep sequential and wide shallow (parallel) designs when parameter count was held fixed. Generalization and robustness sometimes even favored the shallow form at moderate scale. Thus, for tasks where higher-order feature interactions are negligible, shallow architectures can be direct substitutes for deeper stacks, yielding substantial simplification of optimization and training (Bermeitinger et al., 2023).

2. Architectural Transformations: Folding, Truncation, and Fusion

Several transformation algorithms have been proposed to reduce network depth post hoc without reinitializing from scratch:

Layer Folding learns interpolation coefficients for each activation $\sigma_\alpha(x) = \alpha x + (1-\alpha)\sigma(x)$ . When $\alpha$ exceeds a threshold (usually $0.9$), nonlinearities are replaced by identity, and adjacent linear layers are collapsed algebraically ( $W_{\text{fold}} = W_{l+1} W_l$ , $b_{\text{fold}} = W_{l+1}b_l + b_{l+1}$ ). Deep pretrained CNNs such as ResNet-20/32/44/56 and VGG-16/19 on CIFAR can be folded down to $8$–$10$ nonlinear layers with $<1\%$ accuracy drop. On ImageNet, folded MobileNetV2 and EfficientNet-Lite variants show $13$– $25\%$ lower latency at sub- $1\%$ accuracy cost (Dror et al., 2021).
MSE-Optimal Layer Fusion (FuseInit): Instead of retraining, FuseInit replaces pairs of trained layers in a deep network by single layers that minimize the mean-squared error between their joint response and that of the fused layer. For dense-dense, convolutional-dense, and convolutional-convolutional cases, closed-form solutions are provided. Shallow architectures (after fusion initialization) can match or exceed their deeper “parent” in validation accuracy, and convergence speed is typically an order of magnitude faster (Ghods et al., 2020).
Shallowing via Truncation and Advanced Supervised PCA (ASPCA): Deep networks can be truncated at intermediate depths, extracting features up to layer $L$ , then projected via ASPCA which maximizes between-class over within-class scatter in the feature space. Empirically, $5$–$17$ layer truncated nets plus ASPCA preserve recognition functionality for small-group identification tasks (“backyard dog problem”) at $2$– $6\%$ error rates, with $>100\times$ lower parameter count and $<1\,\mathrm{s}$ inference latency on low-resource devices (Gorban et al., 2018).

3. Expressivity and Optimization Limits of Shallow Networks

The expressive power of shallow models is sharply constrained for certain tasks:

Depth Separation: For $d$ -dimensional ball-indicator functions $f(x) = \mathbf{1}\{\|x\|_2 \leq r\}$ under radial distributions, shallower (single-nonlinearity) networks incur an irreducible approximation error $\Omega(d^{-4})$ unless their width is exponentially large ( $e^{cd}$ ). Deeper two-nonlinearity architectures (random-features style) can learn to $\epsilon$ accuracy in $\mathrm{poly}(d,1/\epsilon)$ time via GD, providing the first optimization-based separation result (Safran et al., 2021).
Fractals and Hierarchical Targets: For self-similar (fractal) distributions, deep networks efficiently realize coarse-to-fine decision boundaries, but shallow nets require exponential width in fractal depth $n$ to match the same error. Moreover, if the data distribution is concentrated on fine details, gradient-based optimization of both shallow and deep networks will fail to escape high error plateaus; shallow approximability of coarse features is necessary for SGD trainability (Malach et al., 2019).
Complex Boolean and Compositional Functions: Rosenblatt’s first theorem guarantees that any finite Boolean classification problem can be solved by a shallow three-layer network, but with exponential hidden layer size. Deep architectures exploit compositional structure, collapsing exponential shallow complexity into polynomial size and resource requirements (see travel-maze problem) (Kirdin et al., 2022).

4. Convergence Properties and Optimization Theory

Recent work establishes the robust trainability of shallow (two-layer) ReLU networks:

Under Gaussian data and adversarial label regimes, with both layer weights trained at different rates via gradient descent, global convergence to zero loss is achieved for sample sizes $m \leq \tilde{O}(S)$ , where $S$ is hidden width. The theoretical guarantee covers multi-rate scenarios (first/second layer fast, balanced), with exponential convergence rate $L^{(t)} \leq L^{(0)} \exp(-\Omega(\lambda t))$ and $\lambda = \Omega(S)$ , notably surpassing previous NTK regime results. Empirical evidence finds that kernel eigenvalue stability persists even outside strict NTK bounds (Razborov, 2022).

5. Practical Advantages, Hardware, and Generalization Trade-offs

Shallow neural architectures offer distinct deployment and training advantages:

Optimization and Robustness: Shallow models avoid vanishing/exploding gradients inherent to deep stacks, and parallel-branch architectures exploit hardware parallelism, yielding lower single-batch latency and improved robustness via branchwise independent initializations (Bermeitinger et al., 2023).
Compactness and Efficiency: Auxiliary-output and multi-way backprop methods enable very shallow models (e.g., 44-layer “MwResNet-44” vs. 110-layer ResNet-110) to match or outperform much deeper counterparts, also outstripping pruned and NAS-optimized models in test error and parameter efficiency (Guo et al., 2016).
Graph and Non-Euclidean Domains: Multipath shallow GCNs aggregate multiple short branches rather than deep stacks, eliminating trainability issues and over-smoothing, and outperforming deep and residual GCNs in benchmark node-classification accuracy, typically converging faster (Das et al., 2021).
Knowledge Distillation without Deep Teachers: Free-direction peer distillation frameworks train multiple shallow networks collaboratively using instance-wise RL-based direction selectors—yielding accuracy gains over both standalone shallower models and classical deep-teacher distillation strategies (Feng et al., 2023).

6. Limitations, Loss Landscape, and Depth-Embedding

Structural and optimization analysis reveals intrinsic limitations for shallow nets:

Critical Points and Saddles: By the "embedding principle," all critical points (minima, saddles, plateaus) of shallow networks are embedded as critical manifolds in the loss landscapes of deeper ones. Training on deeper architectures often lifts shallow local minima into escape-able strict saddles, explaining the empirically easier optimization of deep nets and the stagnation risk in truly shallow ones. More data shrinks these manifolds, which further accelerates training (Bai et al., 2022).
Width Explosion in Flattening: Any depth- $m$ ReLU net can be rewritten as a single hidden-layer “max-out” network, but the required number of output units $K$ is exponential in depth: $K = 2^{(m-1)/2 T}$ for total hidden units $T$ (An et al., 2017). Constructive proofs confirm that any deep ReLU model is functionally representable by three layers with extended-real ( $\pm\infty$ ) weights, but this entails exponential width blow-up and untrainable weights, making such conversions intractable except for analytic or explanation purposes (Villani et al., 2023).

7. Practical Construction: Greedy Architectures and Initialization

Greedy algorithms and deterministic network construction further leverage shallower forms:

Greedy Shallow Networks utilize ridgelet-transform based dictionaries and pursuit algorithms to select only the most active neurons, producing highly compact single-hidden-layer ReLU models with theoretical approximation rate $O(N^{-1/2})$ and observed exponential decay of residual error. Greedy initializations outperform random starts consistently for small data and structured targets (Dereventsov et al., 2019).

In summary, the body of research defines both the reach and the boundaries of shallower neural networks. Shallow architectures, when properly constructed via parallelization, folding, truncation, or collaborative training, match deep models under specific representational constraints and task types, yielding substantial efficiency and hardware benefits. However, depth separation results, critical landscapes, and complexity-theoretic analyses establish unavoidable limits to expressive power, learnability, and practical flattening. The choice between shallow and deep neural networks should therefore be guided by task complexity, resource constraints, data structure, and optimization regimen.