Memorization Capacity in Deep ReLU Networks

Updated 5 February 2026

The paper demonstrates that deep ReLU networks can memorize arbitrary data with fewer parameters by leveraging depth-width trade-offs and advanced bit-complexity strategies.
It establishes precise scaling laws, showing that, under separation assumptions, networks can achieve memorization with sub-linear parameters, such as width-3 models using O(N^(2/3)) parameters.
The work employs geometric and combinatorial methods—including random orthogonalization and activation region analysis—to quantify expressivity limits and architect optimal neural structures.

Memorization in deep ReLU networks refers to their capacity to interpolate arbitrary labelings of finite datasets—realizing any mapping from a set of given inputs to specified outputs. This property is a canonical measure of expressivity, distinct from generalization, and has played a central role in recent theoretical analyses of deep learning's empirical success. Recent work has established sharp quantitative bounds on memorization capacity, parameter-count trade-offs, and associated activation complexity, exposing deep links between architectural design, depth–width scaling, bit complexity, and the structure of learned decision boundaries.

1. Theoretical Limits of Memorization Capacity

Classical constructions show that arbitrarily labeled data of size $N$ in $\mathbb{R}^{d}$ can be memorized by a fully connected ReLU network with $\mathcal{O}(N)$ parameters; this corresponds to using a single hidden layer with $N$ neurons, each isolating one datapoint (Yun et al., 2018). However, by exploiting increased depth, recent work has established much sharper upper and lower bounds:

Tight Bounds (General Position/Separation): The minimal number of required parameters is $\tilde{\Theta}(\sqrt{N})$ , where $\tilde{\Theta}(\cdot)$ suppresses logarithmic factors (Vardi et al., 2021). This is achievable under a mild pairwise separation assumption: $\|x_i - x_j\| \geq \delta > 0$ for all $i \neq j$ .
Width–Depth Trade-off: For an $L$ -layer network, the parameter requirement reduces to $\tilde{O}(N/L)$ , optimal up to logarithmic factors. Depth can thus be traded for width.
Bit Complexity and Precision: Achieving these sub-linear parameter counts requires weights of large bit-complexity, packing $\sim\sqrt{N}$ bits into single parameters; with limited precision $B$ , the parameter count scales as $\tilde{O}(N/B)$ (Vardi et al., 2021).
Necessary Condition: These results are tight due to VC-dimension lower bounds, specifically $VCdim=O(W^2)$ for networks with $W$ parameters, meaning memorizing $N$ patterns requires $W = \Omega(\sqrt{N})$ .

The constructive method uses projections, interval bucketing, and bit-extraction networks—implementable with layered, narrow ReLU architectures (Park et al., 2020, Vardi et al., 2021).

2. Depth, Width, and Overparameterization: Fine-Grained Scaling Laws

Explicit constructions and lower bounds clarify how depth and width interplay in determining memorization:

Model Class	Parameter Complexity	Key Architectural Conditions
2-layer ReLU FNN	$\Theta(N)$	Width $N$
3-layer ReLU FNN	$\Theta(\sqrt{N})$	Width $\sim\sqrt{N}$ per hidden layer
General $L$ -layer ReLU	$\tilde{O}(N/L)$	Total hidden-layer parameters $W=\Omega(N)$
Narrow, deep (width-2) ReLU (Hernández et al., 2024)	Depth $2N + 4M - 1$	Width $2$, $M$ classes, all parameters explicit

Depth enables parameter efficiency; for any $\Delta$ -separated dataset, width-3 ReLU nets suffice with $O(N^{2/3})$ parameters and $O(N^{1/3})$ depth (Park et al., 2020). The extreme depth–width trade-off is realized in (Hernández et al., 2024), which provides a constructive scheme achieving memorization with width 2 and depth $2N + 4M - 1$ for $N$ samples and $M$ classes. Conversely, with fixed depth, width must increase: three-layer networks require at least $\Theta(\sqrt{N})$ width to achieve full memorization, and this is both necessary and sufficient (Yun et al., 2018, Vardi et al., 2021).

3. Geometric, Combinatorial, and Architectural Mechanisms

Memorization constructions leverage geometric and combinatorial mechanisms:

Random Orthogonalization: Early layers map sufficiently separated input points to nearly orthogonal hidden representations using random Gaussian weights and sparsifying biases (Vershynin, 2020).
Bit-Packing and Extraction: For optimal parameter efficiency, label information and input identity are packed into large-weight parameters, and depth is used for bit-extraction and range folding (Vardi et al., 2021, Park et al., 2020).
Sparsity via Large Biases: Sparse activation patterns, achieved by high biases, greatly aid the decorrelation and separability of activations, reducing the sample complexity per hidden connection (Vershynin, 2020).
Convex Carving via ReLU Dynamics: Explicit, width-2 constructions use sequential “carving”—using the geometry of ReLU-induced halfspaces to sequentially isolate, collapse, and assign target outputs to each sample (Hernández et al., 2024).
ResNet and General Position: For data in general position, residual connections allow memorization with $O(N/d_x)$ hidden nodes rather than $O(N)$ , via hyperplane carving and class-region separation (Yun et al., 2018).
Ensemble Controllability View: Memorizing networks can be interpreted as discrete control systems that synchronously steer each sample’s trajectory through the network to its assigned label (Hernández et al., 2024).

For random labelings or maximally challenging memorization tasks, these constructions remain near-optimal under known bounds.

4. Activation Region Complexity and Expressivity Gap

A key intrinsic limitation of deep ReLU networks is the scaling of their activation-region (or pattern) complexity during memorization:

Depth-Independence of Region Count: For a ReLU FNN with $N_\mathrm{neurons}$ total hidden neurons and input dimension $n_0$ , the expected number of activation regions intersecting a fixed input cube satisfies $E[\# \mathrm{regions}] \leq T \cdot (N_\mathrm{neurons})^{n_0}/n_0!$ for $N_\mathrm{neurons} \geq n_0$ , with $T\approx1$ for standard random initializations (Hanin et al., 2019). Critically, this bound is polynomial in the total number of neurons, not exponential in depth.
Empirical Upper Bounds: Even under full data memorization or random-label training, region counts grow at most by a small constant factor above initialization and never approach the exponential combinatorial bound (Hanin et al., 2019). For example, with $N_\mathrm{neurons}=96$ , $n_0=2$ , observed region counts cluster around theoretical predictions ( $\sim 4000$ –$5500$ out of a theoretical max of $2^{96}$ ).
Optimization Bias: Gradient-based training, even for pure memorization, exploits only a polynomially sized subspace of the network's full combinatorial region capacity.

This highlights a sharp “expressivity gap”: deep ReLU nets can, in principle, realize exponentially many activation patterns, but memorization tasks—regardless of label or input structure—remain confined to a much smaller pattern subset in practice.

5. Robust Memorization and Parameter Complexity under Perturbations

Memorization under adversarial or robust requirements fundamentally increases parameter complexity:

Robustness Ratio and Complexity Scaling: For a prescribed robustness radius $\mu$ $μ$ (where $\mu = \rho \epsilon_D$ $μ = ρ ϵ_{D}$ , $\epsilon_D$ $ϵ_{D}$ being the minimum across-class separation), the required parameter count transitions through three regimes (Kim et al., 28 Oct 2025):
- For $\rho\leq 1/(5N\sqrt{d})$ , $P=\tilde{O}(\sqrt{N})$ —matching non-robust memorization bounds.
- For $1/(5N\sqrt{d})<\rho\leq 1/(5\sqrt{d})$ , $P=\tilde{O}(N\,d^{1/4}\,\rho^{1/2})$ .
- For $\rho>1/(5\sqrt{d})$ , $P=\tilde{O}(N\, d^2\,\rho^4)$ , reaching the dense network regime as $\rho\to 1$ .

Robust memorization thus interpolates smoothly from the standard regime to the worst-case $Nd^2$ scaling with increasing robustness requirements.

6. Activation Nonlinearity, Internal Phases, and Detection of Memorization

The explicit detection and characterization of memorization in deep ReLU nets can be approached via internal measures of activation nonlinearity:

Non-Negative Rank and Memorization: The non-negative rank of an activation matrix induced by a batch of similar samples measures proximity to linearity; high non-negative rank is strongly indicative of memorization (Collins et al., 2018). Networks generalizing well show increased linearity (low rank) across deep layers, while networks forced to memorize (e.g., random-label training) retain high nonlinearity.
Three Distinct Phases:
- Feature Extraction: Early layers are robust to low-rank compression, capturing generic features.
- Memorization: Intermediate layers exhibit label-noise/proportion-dependent nonlinearity increases.
- Clustering: Late layers focus on bringing like-labeled activations together for final classification.
Practical Application: Tracking the area under the accuracy-vs-rank curve (NMF AuC) over training enables principled early-stopping criteria aligned with the onset of memorization and overfitting.

Such internal diagnostic tools refine the understanding of how, when, and why a network transitions from feature learning to memorization during training.

7. Implications for Optimization, Training Dynamics, and Practical Regimes

Analysis of memorization has several concrete implications for training and model selection:

Gradient Descent and NTK: A positive lower bound on the smallest eigenvalue of the neural tangent kernel (NTK) ensures that gradient flow achieves exact memorization for networks with a single wide layer ( $n_k=\widetilde{\Omega}(N)$ ) (Nguyen et al., 2020).
SGD Around Global Minima: SGD, initialized near an exact memorizing solution, rapidly and exponentially reduces empirical risk in the tangent space of memorized gradients (Yun et al., 2018).
Width–vs–Depth Trade-offs in Practice: Networks employed in contemporary practice are typically wide and overparameterized, far exceeding the theoretical minimum required for memorization, but optimization dynamics (implicit bias, path smoothness) and data structure favor such architectures for generalization.
Design Guidelines: Minimal overparameterization for memorization is $\#\text{connections} \gtrsim N\, \log^5 N$ ; architectural sparsity induced via large biases and careful scaling allows orthogonalization and efficient pattern isolation (Vershynin, 2020).

These findings collectively reveal that memorization efficiency, activation-space geometry, and training dynamics are tightly interwoven in deep ReLU networks. Depth provides parameter efficiency given sufficient width and precision, but the achievable complexity of learned functions in practice is substantially lower than the maximum combinatorial capacity, fundamentally constrained by optimization and input structure.

References:

"Deep ReLU Networks Have Surprisingly Few Activation Patterns" (Hanin et al., 2019)
"Provable Memorization via Deep Neural Networks using Sub-linear Parameters" (Park et al., 2020)
"On the Optimal Memorization Power of ReLU Neural Networks" (Vardi et al., 2021)
"Detecting Memorization in ReLU Networks" (Collins et al., 2018)
"The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets" (Kim et al., 28 Oct 2025)
"Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks" (Nguyen et al., 2020)
"Memory capacity of neural networks with threshold and ReLU activations" (Vershynin, 2020)
"Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity" (Yun et al., 2018)
"Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks" (Hernández et al., 2024)