Width vs. Depth in Neural Networks

Updated 4 March 2026

Width vs. depth in neural networks is a trade-off defining how unit count per layer and the number of layers affect expressivity and approximation efficiency.
Recent studies show that increased depth yields exponential gains in representivity and efficiency, while width enhances optimization stability and supports large-batch training.
Optimal design balances depth and width based on task demands, with guidelines advocating deep, moderately wide networks to avoid issues like representation collapse.

Width versus depth describes how the architectural parameters of neural networks—number of units per layer (width) and number of stacked layers (depth)—impact expressive power, approximation error, learning dynamics, and practical deployment. While classical universal approximation results show that both extreme width and extreme depth can, in principle, approximate arbitrary functions, contemporary theoretical and empirical studies reveal strong asymmetries and nuanced interactions between these two scaling axes. Depth often confers exponential gains in efficiency and expressivity that are unattainable by merely increasing width, while width governs aspects like optimization stability, parallelizability, and large-batch training. The optimal balance is highly task- and architecture-dependent, with precise quantitative guidance contextualized by recent advances across feedforward, convolutional, self-attention, and graph-based models.

1. Theoretical Frameworks: Approximation, Expressivity, and Depth-Width Scaling

The expressive power of neural networks arises from the ability to represent complex target functions with finite resources. Foundational results for ReLU networks establish that with depth-2 and sufficient width, universal approximation is always possible; additionally, width- $(d+4)$ ReLU networks (where $d$ is input dimension) are universal approximators, but a sharp phase transition occurs below this threshold—width- $d$ or less cannot, in general, approximate arbitrary functions (Lu et al., 2017). However, the efficiency of approximation (how network size scales with desired error) is much more strongly affected by depth.

Key results quantify approximation error as a function of both width ( $N$ ) and depth ( $L$ ). For Hölder functions, recent constructions achieve

$E(N,L) = O(N^{-\alpha\sqrt{L}})$

with width $\max\{d,5N+13\}$ and depth $64dL+3$, where $\alpha$ is the Hölder exponent. Here, depth $L$ appears inside the square root in the exponent, leading to "root-exponential" improvement with moderate increases in depth. For fixed width, doubling depth can "square" the error decay, while for fixed depth, reducing error requires exponential increases in width—demonstrating a pronounced asymmetry (Shen et al., 2020). Polynomial dependence on input dimension $d$ 0 enters only as a prefactor.

Depth separation results reinforce this finding: certain "natural" functions (e.g., indicators of high-dimensional balls) provably require exponential width to approximate using shallow networks, but can be captured by small-width, moderate-depth networks (Safran et al., 2016, Vardi et al., 2022). Conversely, any wide, shallow network can be simulated to arbitrary precision by a deep, narrow network, with only a polynomial increase in parameter count and depth, but not vice versa (Vardi et al., 2022).

Topological entropy arguments further formalize that the "complexity" a network can realize satisfies $d$ 1 for a depth- $d$ 2, width- $d$ 3 network, and matching exponential lower bounds show that width must scale as $d$ 4 to realize functions of entropy $d$ 5 with fixed depth—thus depth is exponentially more efficient for representing high-entropy functions (Bu et al., 2020).

2. Empirical Findings: Representation Diversity, Block Structure, and Task-Dependent Effects

Empirical studies using modern deep convolutional architectures reveal that both greater width and greater depth increase raw accuracy, but they yield distinct representational effects. As model capacity is increased, "block structures" appear in hidden representations: large contiguous sequences of layers propagate an almost identical dominant principal component, as detected by Centered Kernel Alignment (CKA) and PCA metrics (Nguyen et al., 2020). This behavior emerges for both wide and deep overparameterized networks but is tied to the capacity/data ratio—suggesting that redundancy or "representation collapse" can be triggered by excessive scaling along either axis.

Systematic differences are observed in per-class accuracy distribution: for matched-parameter, matched-accuracy networks, wide and deep models often make different errors on different data subsets. For example, wide models outperform on "scenes" classes, while deep models excel in "consumer goods" classes. This indicates that width and depth can induce functionally distinct solution modes, especially in the late layers, even when global metrics (top-1 accuracy) are indistinguishable.

A practical implication is that model compression is facilitated by depth-induced redundancy: layers contained within block structures can be pruned with little accuracy loss, providing a rationale for automated slimming via structural analyses (Nguyen et al., 2020).

3. Optimization, Training Dynamics, and Large-Batch Regimes

Depth and width exert different effects on optimization and the training process. Theoretical and experimental analyses show that, for a fixed parameter budget, wider networks are more amenable to large-batch stochastic gradient descent (SGD). This is attributed to higher gradient diversity—wider layers produce more decorrelated sample gradients, enabling larger batch sizes before convergence slow-down occurs. In contrast, deeper networks with narrow layers experience rapid collapse of gradient diversity, limiting scalable training (Chen et al., 2018).

In overparameterized linear and nonlinear regimes, mean-field (μP) analysis and NTK-based perspectives further clarify these distinctions. In μP scaling, "wider is better" holds: increasing width allows stable and fast training, with learning rates that are stable across architectural scales. In "NTK" scaling, wider networks enter a "lazy" regime where feature learning is frozen, and only width-independent kernel regression is possible (Bordelon et al., 4 Feb 2025). For deep-and-wide ReLU networks, the neural tangent kernel (NTK) exhibits variance and feature evolution exponential in depth-to-width ratio ( $d$ 6), allowing feature learning even in parameter regimes considered "lazy" in infinite-width theory (Hanin et al., 2019).

Concerning degeneracy, deep, narrow networks risk "angle collapse": layerwise compression of activation space leads to nearly constant output, impeding learning entirely. Finite-width corrections reveal that to avoid such degeneracy, width should scale at least linearly with depth (rule of thumb: $d$ 7) (Jakub et al., 2023).

4. Architecture-Specific Scaling: Self-Attention, GNNs, and Lightweight Designs

Self-attention (Transformer) architectures exhibit qualitatively different width–depth scaling laws. In narrow settings ( $d$ 8 for hidden size $d$ 9), depth is exponentially more effective: adding layers drastically increases separation rank (expressivity). Beyond this, a "depth-inefficiency" transition arises, and additional depth confers only a linear benefit, while width becomes equally valuable. Quantitative guidelines relate the optimal $d$ 0 allocation to the total parameter budget: for trillion-parameter LLMs, width in excess of 30,000 is predicted to be optimal, not extreme depth (Levine et al., 2020).

For message-passing GNNs, expressive capacity is controlled by the product of depth and width: $d$ 1 must scale polynomially in the graph size to solve even basic tasks (e.g., cycle detection, diameter estimation) (Loukas, 2019). Shallow–wide and narrow–deep networks are largely interchangeable in worst-case complexity, but neither alone is sufficient when strict limits are imposed.

In resource-constrained or real-time settings, depth–width trade-off must also account for efficiency–accuracy balance. For lightweight vision tasks, tailored designs (e.g., CTD-Net, TinyNet) show that maintaining moderate depth and input resolution while aggressively pruning width yields substantially better accuracy-parameter and accuracy-latency trade-offs than compound scaling along all axes or depth-only/width-only strategies (Li et al., 2023, Han et al., 2020). In these regimes, width can often be reduced to minimal values with only minor loss in accuracy, while depth and resolution dominate (Han et al., 2020).

5. Hybrid and Quasi-Equivalence Phenomena: When Width and Depth Are Exchangeable

Under some conditions, width and depth are "quasi-equivalent": for ReLU networks, any function realized by a wide network can be approximated to arbitrary measure accuracy by a deep network of moderate, fixed width, and vice versa, though polynomial increases in depth or width are typically required (Fan et al., 2020). In quadratic-neuron networks, factorization (wide) and continued-fraction (deep) representations further support the equivalence in polynomial function approximation.

However, this symmetry is fundamentally limited in standard architectures. Worst-case separation theorems demonstrate exponential costs to interchange depth and width indiscriminately (Lu et al., 2017, Vardi et al., 2022). Thus, while hybrid or balanced approaches are often practical, and depth/width can be traded off to a degree, the regimes where one can fully substitute for the other are sharply delimited.

6. Design Guidelines and Practical Implications

The following synthesized rules emerge:

For function approximation and expressivity, increasing depth (beyond a minimal width) is exponentially more cost-effective than increasing width alone (Safran et al., 2016, Lu et al., 2017, Vardi et al., 2022, Bu et al., 2020).
Architectural design should avoid deep, narrow networks below the $d$ 2 limit to preclude degeneracy and poor training (Jakub et al., 2023).
In large-batch distributed training, favor wider–shallower designs to maximize gradient diversity and scaling efficiency under a fixed parameter budget (Chen et al., 2018).
For self-attention models, leverage the depth-efficient regime (before depth-inefficiency transition) via moderate-to-high width, especially at scale (Levine et al., 2020).
Lightweight and real-time models benefit more from preserving depth and resolution, reducing width only as needed to satisfy FLOPs/latency constraints (Li et al., 2023, Han et al., 2020).
Hybrid paradigms—parameter sharing along depth, MoE-based width expansion, and sharing normalization statistics—enable parameter-efficient architectures that outperform strictly deeper models of equivalent parameter count in practical domains (Xue et al., 2021).
For GNNs, budget the product $d$ 3 according to global versus local task requirements; both dimensions are essential when capturing complex graph properties (Loukas, 2019).

7. Outstanding Questions and Future Directions

Despite substantial advances, several open challenges remain:

Characterizing and preventing representation collapse and other forms of degeneracy in extremely wide/deep overparameterized regimes (Nguyen et al., 2020, Jakub et al., 2023).
Extending asymptotic results to convolutional, residual, and attention-based architectures in realistic, finite-data contexts.
Developing regularization and architecture search strategies that explicitly optimize depth–width allocation for given hardware and data constraints.
Clarifying the diminishing returns and transitions between depth- and width-dominated regimes, particularly in transformer-based and GNN architectures at multi-billion parameter scales (Levine et al., 2020).
Exploring the impact of depth–width scaling on generalization, compression, and robustness beyond empirical accuracy.

The corpus of contemporary research demonstrates that while width and depth can each independently enable universality, their roles are fundamentally distinct across approximation theory, optimization, representation learning, and pragmatic deployment. The optimal depth–width configuration is highly context-dependent, but robust design principles have now emerged to guide architecture selection in both theoretical and practical regimes.