Network Depth and Width

Updated 1 April 2026

Network depth and width are fundamental parameters that define neural network architecture, with depth indicating sequential transformations and width representing the number of neurons per layer.
Trade-offs between depth and width determine the network's ability to approximate functions, optimize training landscapes, and achieve universal approximation under specific minimal width bounds.
A balanced allocation of depth and width is crucial for enhancing training performance, memorization capacity, and robustness while mitigating issues like representation collapse in overparameterized models.

Neural network depth and width are the principal axes of architectural scaling in feedforward and deep learning models, controlling the arrangement and capacity of artificial neural networks. Depth refers to the number of sequential transformations (layers), while width indicates the maximal number of units (neurons, channels, or waveguides) in any layer. These two parameters jointly determine expressivity, optimization landscape, memorization capacity, generalization regime, and robustness to perturbations.

1. Universal Approximation, Minimal Width, and Depth–Width Substitution

Foundational universal approximation theorems establish that an increase in either depth or width enables a network to approximate arbitrary functions from suitable classes. The classic result for feedforward ReLU networks of input dimension $n$ states that a single hidden layer ( $d=1$ ) and infinite width $w\rightarrow\infty$ suffices for uniform approximation of any continuous function $f:[0,1]^n\to\mathbb{R}$ in the $\|\cdot\|_\infty$ norm (Nosirov et al., 2019).

However, the minimal width for universality is tightly constrained. Hanin (2017) proved that for ReLU networks, width $w\geq n+2$ suffices for universal approximation as depth $d\to\infty$ , while $w\leq n$ precludes even $L^1$ approximation for generic functions (Lu et al., 2017). There is a phase transition: networks of width $w\leq n$ can only represent degenerate functions (constant along some direction), and width $d=1$ 0 universally suffices in $d=1$ 1 (Lu et al., 2017).

Empirically, Nosirov and Hokanson confirmed this minimal width bound: for $d=1$ 2, widths $d=1$ 3 manifest a “phase transition” in error decay as depth increases, matching Hanin's theoretical threshold (Nosirov et al., 2019).

Regime	Minimal Width for Universal Approximation
ReLU, $d=1$ 4	$d=1$ 5
Depth-bounded, arbitrary width	No bound; width $d=1$ 6 required
Width-bounded, arbitrary depth	$d=1$ 7
Width $d=1$ 8	No universal approximation possible

Within these constraints, depth and width are, to some extent, interchangeable for expressive power: increased depth in sufficiently wide networks can recover the full function class, and vice versa. This quasi-equivalence is formalized via explicit network transforms that construct, up to an $d=1$ 9 or $w\rightarrow\infty$ 0 error, a wide network for any deep network and vice-versa, at the price of a sometimes substantial (poly-exponential) increase in width or depth (Fan et al., 2020).

2. Depth–Width Trade-offs and Expressivity Bounds

The depth–width trade-off quantifies how much of one resource can be substituted for another. For ReLU networks and classical semi-algebraic architectures, the number of affine (linear) regions produced in the input space is at most $w\rightarrow\infty$ 1 where $w\rightarrow\infty$ 2 is width and $w\rightarrow\infty$ 3 is depth (Fan et al., 2023). This combinatorial explosion underlies the depth efficiency phenomenon: for certain function classes, deep narrow networks can represent high-frequency (or oscillatory, or compositionally structured) functions that any shallow wide network cannot, unless its width (or parameter count) is exponential in the target accuracy (Safran et al., 2016, Chatziafratis et al., 2020, Bu et al., 2020).

Depth-separation results prove that for some explicit functions (ball-indicator, $w\rightarrow\infty$ 4-radial, smooth $w\rightarrow\infty$ 5 functions), shallow networks—even of exponential width—cannot achieve nontrivial approximation error, whereas networks of moderate depth and only polynomial width suffice (Safran et al., 2016). These results are further refined via dynamical systems: the minimal width needed to fit a function of topological entropy $w\rightarrow\infty$ 6 with depth $w\rightarrow\infty$ 7 grows as $w\rightarrow\infty$ 8 (Bu et al., 2020). Iteration of maps with odd-prime period forces an exponential separation in the minimal required width versus allowed depth (Chatziafratis et al., 2020).

Width-efficiency, by contrast, is at most polynomial: for any wide shallow ReLU network, matching it with a deep narrow network only requires a polynomial increase in depth (and total parameter count). Conversely, replicating a deep network’s function with a shallow network can require exponential width (Lu et al., 2017, Vardi et al., 2022).

Trade-off Type	Lower Bound on Overhead	Scaling Regime
Depth-efficiency	Exponential in depth	Functionally deep
Width-efficiency	Polynomial in width	Functionally wide

A constructive demonstration is provided in (Vardi et al., 2022): any target network on $w\rightarrow\infty$ 9-dimensional input can be approximated up to error $f:[0,1]^n\to\mathbb{R}$ 0 by a width $f:[0,1]^n\to\mathbb{R}$ 1 network of polynomially larger depth (relative to original width and depth), whereas no such result holds if the roles of width and depth are swapped.

3. Impact on Optimization, Training Landscape, and Memorization Capacity

Increasing depth and width affects both the geometry of the optimization landscape and the network's capacity to memorize finite datasets. Kawaguchi et al. show that, without any overparameterization assumption, increasing either depth or width systematically improves the worst-case training loss value at differentiable local minima for squared error, as each added neuron or layer increases the dimensionality of the network's local linear span (Kawaguchi et al., 2018). In the limit, as depth or width grows sufficiently, all local minima are globally optimal (interpolation regime).

Memorization capacity is characterized by the relation $f:[0,1]^n\to\mathbb{R}$ 2, where $f:[0,1]^n\to\mathbb{R}$ 3 is the number of data points and $f:[0,1]^n\to\mathbb{R}$ 4 the minimal separation (Yang et al., 10 Mar 2026). This balance can be allocated flexibly: fixing width to $f:[0,1]^n\to\mathbb{R}$ 5 requires depth $f:[0,1]^n\to\mathbb{R}$ 6, while fixing depth to $f:[0,1]^n\to\mathbb{R}$ 7 requires width $f:[0,1]^n\to\mathbb{R}$ 8. Adding depth or width beyond this optimal allocation yields no additional memorization benefit.

4. Statistical and Generalization Properties in the Large-Width and Depth Limits

In the regime of very large width and/or depth, neural networks often exhibit emergent statistical properties, most notably the convergence to Gaussian processes and related kernel regimes.

For plain fully-connected networks, the infinite-width limit at fixed depth leads to neural network Gaussian processes (NNGPs), with fixed, data-independent kernels. However, keeping depth fixed and sending width to infinity eliminates feature learning: variability in representations and adaptability to the dataset vanish (Pleiss et al., 2021, Zhang et al., 2021). Increasing width beyond the necessary capacity smooths out model non-Gaussianity, flattening adaptive tails in the posterior.

Extending to the double limit—depth and width tending to infinity together—yields different behaviors depending on architectural scaling and skip-connection structure. In fully-connected networks, the variance of the neural tangent kernel grows exponentially in the ratio $f:[0,1]^n\to\mathbb{R}$ 9 (depth/width), and its dynamics during training become nontrivial in this regime, allowing data-dependent feature learning even in the “lazy” kernel regime (Hanin et al., 2019).

For residual networks (ResNets) with branch scaling $\|\cdot\|_\infty$ 0, both theoretical analysis and numerical evidence establish that the limits of infinite depth and infinite width commute: the final covariance/kernel structure is independent of the order in which width and depth are sent to infinity (Hayou et al., 2023, Hayou, 2023). The neural outputs become Gaussian with variance and covariance given by a closed-form ODE, and empirical Wasserstein distances converge at the rate $\|\cdot\|_\infty$ 1. This principle greatly simplifies the analysis and scaling of modern deep architectures.

5. Task-Dependent Behavior and Practical Implications

Depth and width influence not only capacity and optimization, but also the learned representations and task performance. In self-attention architectures such as transformers, theoretically predicted transitions exist between depth-efficiency and width-efficiency, controlled by a logarithmic curve $\|\cdot\|_\infty$ 2, with $\|\cdot\|_\infty$ 3 the model hidden size (Levine et al., 2020). For budgets up to the scale of LLMs, optimal performance is achieved by carefully balancing depth and width: architectures that are “too deep” or “too narrow” (or vice-versa) are suboptimal.

In conventional convolutional or residual networks, as depth or width increases beyond a critical capacity threshold, the internal representations enter a collapsed “block structure” where successive layers' outputs align nearly perfectly (as revealed by centered kernel alignment, CKA) (Nguyen et al., 2020). This overparameterization saturates expressivity, and further increases in depth or width become redundant. Wide networks tend to propagate and accentuate global, scene-level principal components, while deep networks better distinguish fine-grained structure. Overparameterized models trained on limited data collapse representations along the principal direction, substantiating the need to tune network scaling both to task complexity and dataset size.

From a robustness perspective, width and depth have nuanced, regime-dependent effects. Width improves average-case robustness in overparameterized regimes but worsens it in underparameterized ones. The effect of depth on robustness varies strongly with initialization scheme and training regime: under certain initializations (LeCun), depth exponentially increases robustness, while under others (He, NTK), deeper models actually perform worse for robustness (Zhu et al., 2022).

6. Special Network Classes and Engineering Trade-offs

In programmable photonic networks, degrees of freedom required for universality in matrix transformation can be allocated as either width (number of waveguides) or depth (number of programmable layers), with scaling laws such as $\|\cdot\|_\infty$ 4 (unitary, with $\|\cdot\|_\infty$ 5 being target matrix size and $\|\cdot\|_\infty$ 6 the number of active layers) (Markowitz et al., 5 Mar 2025). A minimal depth of $\|\cdot\|_\infty$ 7 and a width proportional to $\|\cdot\|_\infty$ 8 suffices, underscoring the generality of depth–width trade-offs beyond electronic nn models.

In feedforward ReLU nets, architectural hierarchies ("height," via intra-layer links among $\|\cdot\|_\infty$ 9 neurons) further enhance expressivity: a network of width $w\geq n+2$ 0, depth $w\geq n+2$ 1, and height $w\geq n+2$ 2 achieves a linear region count $w\geq n+2$ 3, improving over the standard $w\geq n+2$ 4 (Fan et al., 2023). This offers another axis for model design—serializing nonlinearity not just across layers (depth), but within them as intra-layer hierarchy.

7. Concluding Synthesis

Depth and width are, in a precise sense, quasi-equivalent for expressivity at the cost of potentially large parameter overhead (exponential for simulating depth with width, only polynomial for width with depth) (Fan et al., 2020, Vardi et al., 2022). For memorization, training loss minima, and function class approximation, a flexible allocation of resources to either dimension is possible, but with inherent architectural and computational implications. However, depth is structurally more effective than width in improving expressivity due to the possibility of depth efficiency theorems with exponential separation, while width efficiency is at best polynomial (Lu et al., 2017, Vardi et al., 2022).

Optimal network design, under any fixed parameter or compute budget, requires tuning depth and width jointly, guided by problem structure, data size, and desired trade-offs in expressivity, generalization, and robustness. Newer directions include understanding how engineering trade-offs exploit architectural symmetries, commutativity of double scaling limits, and higher-dimensional extensions such as intra-layer hierarchies.

Depth and width remain central axes of architectural and theoretical advances, and their interplay sets the stage for emergent behaviors from universal approximation limits to representation collapse and double-scaling criticality across deep learning models.