Overparameterized Two-Layer ReLU Networks

Updated 22 October 2025

Overparameterized two-layer ReLU networks are neural architectures with a single hidden layer and many more neurons than data samples, enabling efficient yet expressive representations.
The analysis reveals that early alignment of neuron weights promotes a simplicity bias, which drives implicit regularization and mitigates overfitting despite high capacity.
The article demonstrates that benign loss landscapes, controlled memorization capacity, and optimal generalization emerge in these networks, challenging traditional bias–variance trade-offs.

Overparameterized two-layer ReLU networks are neural architectures with a single hidden layer—typically containing many more neurons than data samples—that employ the Rectified Linear Unit (ReLU) activation. This configuration has been the subject of extensive theoretical analysis, revealing deep connections between expressiveness, optimization, implicit regularization, memorization, and generalization in highly parameterized settings. Despite their apparent capacity to interpolate arbitrary data, such networks often display a strong simplicity bias, converging to solutions that generalize remarkably well even in the presence of significant noise. This article presents a detailed examination of the mathematics, dynamics, and consequences of this phenomenon based on recent theoretical and empirical advances.

1. Architecture, Expressive Power, and Functional Equivalence

A two-layer ReLU network computes a function of the form

$h_\theta(x) = \sum_{j=1}^m a_j\,\sigma(w_j^\top x),$

with $m \gg n$ , where $n$ is the number of training samples, $\sigma$ is the ReLU activation, and $a_j$ and $w_j$ are output and input layer parameters, respectively.

The expressive power of such a network is characterized, via logical analysis, as a disjunctive normal form (DNF) over linear threshold inequalities. Specifically, the decision boundary of a two-layer ReLU network can be unfolded into a disjunction of conjunctions of hyperplane threshold functions. Formally, the classifier

$y = \operatorname{sgn} \left[w_0 + \sum_{k \in \mathcal{P}} R(a_k(x)) - \sum_{k \in \mathcal{N}} R(a_k(x))\right]$

(where $a_k(x) = u_k \cdot x + b_k$ and $R$ is ReLU) is equivalent in expressiveness to a threshold network possessing exponentially more units—specifically, $2^{|\mathcal{P}|}+2^{|\mathcal{N}|}$ hidden threshold units spanning all activation patterns (Pan et al., 2015).

This exponential efficiency implies that even a relatively small two-layer ReLU network captures highly intricate decision boundaries, which, if represented in a threshold network, would require massively more units. Notably, under certain algebraic sufficient conditions, it is possible to compress threshold networks with $2^n$ units into ReLU networks with only $n$ units, showing that ReLU networks enable highly efficient, yet expressive, representations.

2. Optimization, Simplicity Bias, and Early Alignment Dynamics

Despite their immense parameter space, overparameterized ReLU networks often do not fit the training data exactly when the number of samples is large. Instead, they exhibit a pronounced "simplicity bias": the training dynamics, particularly with small initialization scales, cause neurons' weight vectors to quickly align toward a small discrete set of "extremal" directions (vectors maximizing the piecewise-linear correlation function

$G_n(w) = \langle w, D_n(w,0) \rangle, \quad D_n(w,0) = \frac{1}{n}\sum_{k=1}^n \mathbb{1}_{\{w^\top x_k > 0\}} y_k x_k,$

) during the early phase of training (Boursier et al., 3 Oct 2024).

This "early alignment phase" induces directional quantization: most neurons eventually point only at a handful of these extremal directions, grouping themselves (by direction) to capture the principal axes of the data's true generative structure. As a consequence,

The effective number of free parameters is drastically reduced.
The network ignores much of the capacity to fit high-frequency noise, instead converging to an estimator that closely approximates the population minimizer (e.g., an ordinary least squares estimator in regression contexts).

The onset of an "optimization threshold" is thereby observed: as the number of training samples increases beyond a critical regime (often scaling polynomially in input dimension), the network loses the capacity to interpolate all training samples but, instead, settles into these simplicity-biased, non-interpolating local minima, which correspond to near-optimal generalization (e.g., the OLS solution in regression). This transition is a beneficial effect, not a pathology; above the optimization threshold, generalization improves even as training loss plateaus above zero (Boursier et al., 3 Oct 2024).

3. Memorization, Overparameterization, and Capacity

The memorization capacity of two-layer ReLU networks is tightly linked to the number of weights (connections). With sufficient overparameterization—that is, when the product of the hidden and input dimensions exceeds the number of samples up to logarithmic factors—the network can "memorize" any arbitrary labeling (i.e., fit any assignment of binary labels) (Vershynin, 2020). A two-phase probabilistic construction—an enrichment phase (using sparse, random projections with ReLU activations and large bias) followed by a perceptron phase—yields an explicit weight allocation achieving perfect fitting. The memory capacity, $K_{\mathrm{max}}$ , scales as

$K_{\mathrm{max}} \sim \frac{W}{\log^5 W}$

where $W$ is the number of deep connections.

This theoretical result provides a rigorous basis for the empirical observation that overparameterized networks can fit even random labels (Pan et al., 2015, Vershynin, 2020). Conversely, when in the regime above the optimization threshold and subject to early alignment, the network may not fully exploit this memorization capacity, aligning instead with the low-complexity structure of the data.

4. Loss Landscape, Critical Points, and Spurious Minima

The loss landscape of an overparameterized two-layer ReLU network is partitioned into a vast number of "activation regions," each determined by the binary activation pattern $A$ of all neurons across all datapoints. In any such region, the loss is smooth and, with mild overparameterization (e.g., width $m \gtrsim n/d_0$ ), a generic dataset ensures that the Jacobian of the network output with respect to parameters is full-rank almost everywhere. This property eliminates spurious differentiable local minima within most regions, confining the global minima to affine subspaces that interpolate the data (Karhadkar et al., 2023).

Moreover, results from algebraic geometry and symmetry analysis reveal that overparameterization "annihilates" symmetric spurious minima: adding even a single neuron to a two-layer ReLU network can turn previously existing non-global minima into saddle points through the addition of descent directions in the parameter space, as exposed via the block decomposition of the Hessian operator (Arjevani et al., 2022). This enhances the favorability of the optimization landscape for gradient-based methods, making convergence to robust (possibly non-interpolating) solutions typical in practice.

5. Implicit Regularization, Generalization, and Statistical Behavior

The deviation of overparameterized two-layer ReLU networks from pure interpolation is not random. Instead, a pronounced implicit regularization effect operates:

The early alignment phase clusters neurons along few directions, biasing learning toward low-complexity (often linear or low-multivariate-index) models (Boursier et al., 3 Oct 2024, Parkinson et al., 2023).
When trained with weight decay, the induced regularizer becomes equivalent to a group sparsity penalty, enforcing solutions nearly as simple as possible (often sparse, or low-rank in weight matrices) (Wang et al., 2022, Parkinson et al., 2023).
Empirically, networks with very large widths and standard regularization generalize as well or better than their less overparameterized counterparts, even with massively more parameters than samples (Wang et al., 2022, Boursier et al., 3 Oct 2024).
The simplicity bias yields solutions with test risk comparable to an optimal population estimator (e.g., OLS), and transitions away from "harmful" interpolation as soon as the number of training points exceeds the optimization threshold (Boursier et al., 3 Oct 2024).

This behavior can be analytically described by the decomposition of the alignment function $G_n(w)$ and the dynamics of the associated gradient flow. The resulting estimator, particularly in regression with a linear teacher, converges to a convex combination of one-sided OLS estimators: $h_{\theta_\infty}(x) = (\beta_{n,+}^\top x)_+ - (-\beta_{n,-}^\top x)_+,$ where $\beta_{n,+}$ and $\beta_{n,-}$ are least squares estimators over the positively and negatively labeled data (Boursier et al., 3 Oct 2024).

6. Implications and Relevance for Learning Theory and Practice

The synthesis provided by this body of work implies that the phenomenon of overparameterized, shallow (two-layer) ReLU networks challenges the classical bias–variance dilemma:

Networks perform optimally when allowed to enter the simplicity-biased regime, above the optimization threshold, even at the cost of non-interpolation.
The geometry of the loss landscape, via overparameterization, is highly benign: spurious minima are rare or easily escaped, and most regions are dominated by global minima with favorable statistical properties (Karhadkar et al., 2023, Arjevani et al., 2022).
The absence of full interpolation with large datasets is not a failure but, rather, a manifestation of the network identifying the underlying, generalizable signal.
The practical design of such networks should emphasize initial scaling, learning rates, and regimes conducive to feature learning and early alignment, rather than mere interpolation performance (Boursier et al., 3 Oct 2024).

This comprehensive framework, unifying analysis of expressivity, optimization, loss landscape topology, memorization, and statistical regularity, provides a rigorous basis for understanding the generalization and favorable learning behavior of overparameterized two-layer ReLU networks in high-dimensional problems and realistic data settings.