Papers
Topics
Authors
Recent
2000 character limit reached

Logarithmic Over-Parameterization in Neural Networks

Updated 11 November 2025
  • Logarithmic over-parameterization is a framework where the network width grows only logarithmically with parameters like depth and sample size, ensuring necessary expressivity with minimal over-complexity.
  • It guarantees that pruning, gradient descent optimization, and tensor decomposition can be achieved with only a slight increase in width under proper spectral and margin conditions.
  • Empirical validations and combinatorial methods such as SubsetSum confirm that this modest over-parameterization suffices for high-accuracy learning and effective network pruning in practice.

Logarithmic over-parameterization denotes a regime in neural network design and analysis wherein the network's width (or number of parameters) exceeds that of the minimal target network by only a logarithmic factor in problem-specific parameters, such as layer width, network depth, training sample size, or approximation precision. Recent theoretical work has established that, under suitable conditions, this mild form of over-parameterization is not only sufficient but in certain cases also necessary for achieving expressivity, optimization, and generalization guarantees that previously required polynomial or higher over-parameterization. This has led to a paradigm shift in understanding the structural and algorithmic requirements for the success of neural network pruning, learning, and tensor decomposition.

1. Fundamental Definitions and Problem Setup

The concept of logarithmic over-parameterization is formally instantiated in recent analyses of the lottery ticket hypothesis (LTH), optimization of over-parameterized neural networks via gradient descent, and over-parameterized tensor decomposition. The prototypical definitions and constructions are as follows:

  • Target Network: Given a fully-connected ReLU network f:Rd0Rdlf:\mathbb{R}^{d_0}\to\mathbb{R}^{d_l} of depth ll, with layer widths did_i and weight matrices WiW_i bounded in spectral norm, the target network is denoted:

f(x)=Wlσ(Wl1σ(W1x))f(\mathbf{x}) = W_l\sigma(W_{l-1}\cdots\sigma(W_1\mathbf{x})\cdots)

where σ\sigma is the ReLU activation and d=maxidid = \max_i d_i.

  • Over-parameterized Random Network: A random network gg of depth $2l$ is constructed such that each layer width is expanded by a factor O(log(dl/ϵ))O(\log(dl/\epsilon)) compared to the target, with weights initialized i.i.d. from uniform or Gaussian distributions:

g(x)=M2lσ(M2l1σ(M1x))g(\mathbf{x}) = M_{2l}\sigma(M_{2l-1}\cdots\sigma(M_1\mathbf{x})\cdots)

  • Pruning and Subnetworks: Pruning is formalized as masking the over-parameterized network with binary masks (Sj)(S_j), yielding a subnetwork g^\hat{g}; the uniform approximation error is measured as

supx1f(x)g^(x)ϵ.\sup_{\|\mathbf{x}\|\le 1} \|f(\mathbf{x}) - \hat{g}(\mathbf{x})\| \le \epsilon.

Logarithmic over-parameterization thus requires that the number of neurons/parameters in the over-parameterized network is only a logarithmic factor larger than in the target network or sample size, up to constants depending on depth, accuracy, and statistical margin (Pensia et al., 2020, Chen et al., 2019).

2. Theoretical Guarantees and Sufficiency

The core advance is rigorous proof that logarithmic over-parameterization suffices for several key objectives:

  • Strong Lottery Ticket Hypothesis (LTH): Any target ReLU network of width dd and depth ll can be ϵ\epsilon-approximated, in the uniform norm on the unit ball, by pruning a random network that is a factor O(log(dl/ϵ))O(\log(dl/\epsilon)) wider and twice as deep. Specifically, with high probability over the draw of gg, every target ff in a class of spectral-norm-bounded ll-layer networks admits binary masks (Sj)(S_j) such that

supx1f(x)g^(x)<ϵ.\sup_{\|\mathbf{x}\|\le 1} \|f(\mathbf{x}) - \hat{g}(\mathbf{x})\| < \epsilon.

  • Gradient Descent Optimization with Logarithmic Over-Parameterization: In supervised learning with two-layer or deep ReLU nets, as soon as the network width m=Ω(nlogn)m = \Omega(n \log n) (for nn data samples), gradient descent can achieve empirical risk O(1/n)O(1/\sqrt{n}) in the regime where the target function admits a low-rank approximation in the eigenspaces of the associated integral operator. For deep nets, under margin/separability assumptions, width m=Ω~(poly(L,1/γ)polylog(n,1/ϵ))m = \widetilde{\Omega}(\mathrm{poly}(L, 1/\gamma) \cdot \mathrm{polylog}(n, 1/\epsilon)) suffices to guarantee trainability and generalization (Chen et al., 2019, Su et al., 2019).
  • Tensor Decomposition Beyond the Kernel Regime: In ll-th order tensor decomposition of rank rr in dd dimensions, a gradient descent algorithm can find an ϵ\epsilon-approximation with number of components m=O~(r2.5llogd)m = \widetilde{O}(r^{2.5l} \log d), showing only logarithmic dependence on dimension dd, as opposed to polynomial requirements in kernel or mean-field analyses (Wang et al., 2020).

Table: Logarithmic Over-Parameterization Results

Problem Class Sufficient Width Additional Conditions
Pruned ReLU Net (LTH) O(dlog(dl/ϵ))O(d \log(dl/\epsilon)) Spectral norm, depth ll
Two-layer ReLU network (GD) Ω(nlogn)\Omega(n \log n) Zero low-rank residual
Deep ReLU net (GD/SGD) O~(poly(L,1/γ) polylog(n,1/ϵ))\widetilde{O}(\mathrm{poly}(L, 1/\gamma)~\mathrm{polylog}(n, 1/\epsilon)) Margin assumption
Overparam tensor decomp. O~(r2.5llogd)\widetilde{O}(r^{2.5l} \log d) Algorithmic changes

The theoretical results are generally nonconstructive with respect to which subnetworks or initializations admit these properties; they guarantee existence with high probability.

3. Proof Strategies Leveraging SubsetSum and Concentration

The achievability of logarithmic over-parameterization is established via a modular proof architecture unifying combinatorial, functional-analytic, and probabilistic arguments:

  • Combinatorial Embedding via SubsetSum: For each target weight w[1,1]w \in [-1,1], n=O(log(1/ϵ))n = O(\log(1/\epsilon)) i.i.d. random features are sufficient such that any ww can be approximated by a sum over a subset of these features with error ϵ\leq \epsilon (Lueker, 1998). This is mapped to a tiny two-layer ReLU subnetwork through careful splitting of positive/negative parts, enabling explicit construction of prunings implementing arbitrary weights.
  • Hierarchical Block Construction: Weights are grouped to efficiently approximate neurons, then layers, then the entire network. The key is that each added hierarchy level multiplies only logarithmic overhead, resulting in overall network widths scaling as O(dlog(dl/ϵ))O(d \log(dl/\epsilon)) (Pensia et al., 2020).
  • Spectral/Kernel Control for Learning: In over-parameterized learning via GD, analysis relies on concentration inequalities for empirical NTK kernels and their proximity to population-level integral operators. Ensuring the spectrum of the empirical kernel matrix remains sufficiently separated over T=O(logn)T=O(\log n) iterations forces a logn\log n-factor overhead in width to guarantee uniform control for all iterations with high probability (Su et al., 2019).
  • Regularization and Low-Rank Structure for Tensors: Escaping the kernel or “lazy training” regime in over-parameterized tensor decomposition requires explicit algorithmic interventions: 2-homogeneous reparameterization, an 2\ell_2-regularizer on uiu_i, periodic reinitialization, and controlled step sizes. This enables gradient descent to exploit the low-rank structure and drives the convergence to depend only logarithmically on the ambient dimension (Wang et al., 2020).

4. Optimality and Lower Bounds

Analysis demonstrates that logarithmic over-parameterization is not only sufficient but, up to constants, necessary:

  • Lower Bound for Pruning: Any depth-2 random ReLU network g:RdRdg:\mathbb{R}^d\to\mathbb{R}^d of width mm that can, via pruning, uniformly ϵ\epsilon-approximate every d×dd\times d norm-1 linear operator with probability 1/2\geq 1/2 must have m=Ω(dlog(1/ϵ))m = \Omega(d \log(1/\epsilon)) (Pensia et al., 2020). This is derived via covering number arguments: the number of ϵ\epsilon-balls needed to cover all such operators is exp(d2log(1/ϵ))\exp(d^2\log(1/\epsilon)), far exceeding what can be represented by 2m2^m subnetworks unless mm scales at least logarithmically.
  • Necessity in Optimization and Approximation: In learning theory, the requirement m=Ω(nlogn)m = \Omega(n \log n) is enforced by the need to control kernel matrix concentration over O(logn)O(\log n) iterations, with union bounds once again imposing logarithmic slack on top of the linear scaling in sample size (Su et al., 2019). In tensor decomposition, the kernel regime lower bound m=Ω(dl1)m = \Omega(d^{l-1}) is dramatically improved to m=O~(r2.5llogd)m = \widetilde{O}(r^{2.5l} \log d) only by leveraging the low-rank structure and appropriate regularization (Wang et al., 2020).

A plausible implication is that further reductions below the logarithmic threshold would require either fundamentally new algorithmic insights or stronger assumptions about the data or labels.

5. Experimental Validation and Practical Implications

Empirical results corroborate the theoretical predictions that logarithmic over-parameterization is both sufficient and, within tested regimes, necessary:

  • Weight Pruning via SubsetSum: On a two-layer MLP (500 hidden units, ~400K weights) trained on MNIST, replacing each weight by the sum of 21 random features selected via SubsetSum minimization does not degrade test accuracy (97.19%), despite the combinatorial search. This process verifies that the SubsetSum embedding is algorithmically and practically viable at network scale, albeit with substantial computational cost (~21.5h on 36 CPU cores for all weights) (Pensia et al., 2020).
  • Pruning Random Overparameterized Networks: In 2-layer and 4-layer MLPs and LeNet5 architectures, “structured” networks constructed with SubsetSum-inspired wide hidden layers, when pruned, outperform standard random networks pruned to equivalent budgets, confirming that logarithmic over-parameterization admits efficient approximation post-pruning.
  • Sample Complexity and Practical Widths: For modern network sizes (mm in the hundreds, n104n \sim 10^4), the polylogarithmic width guarantees suggest that, under appropriate assumptions (notably the margin condition), practical networks do not need to be widely over-parameterized beyond modest multiples of nn or dd.
  • Algorithmic Limitations: The combinatorial cost of finding optimal pruning masks or solving SubsetSum instances precludes naïve approaches for large-scale training. In learning settings, existing guarantees rely on margin/separability assumptions, which may not hold for real data.

6. Extensions, Limitations, and Open Directions

While logarithmic over-parameterization has reshaped understanding of expressivity and trainability, several important caveats and areas for further research remain:

  • Extensions Beyond ReLU and Supervised Settings: The theory to date is almost entirely focused on fully-connected ReLU networks. The extension to architectures incorporating batch normalization, residual connections, or convolutions remains unaddressed.
  • Dependence on Margin and Data Separability: The sharpest polylogarithmic over-parameterization results require explicit margin assumptions in the NTK-induced feature space. For general data distributions without such separability, required widths revert to polynomial regimes, and the closing of this gap is an outstanding open question (Chen et al., 2019).
  • High-Order Tensor and Mean-Field Regimes: For tensor decomposition, breaking the NTK regime and achieving logarithmic dependence requires low-rank structure; for generic high-order tensors or functions without exploitable structure, over-parameterization requirements may remain polynomial or worse (Wang et al., 2020).
  • Algorithmic Feasibility: While existence proofs establish the sufficiency of logarithmic over-parameterization, efficient algorithmic identification of optimal prunings or subnetworks remains intractable for large networks, except for proxy heuristics (e.g., edge-popup, magnitude-based pruning) whose theoretical properties are not fully understood.
  • Representation Learning and Feature Learning: NTK-regime guarantees, which underpin many logarithmic width results, do not capture the phenomenon of representation learning observed in real-world neural networks. The conditions under which polylogarithmic-width networks can learn useful features beyond their initial random affordances remain unclear (Chen et al., 2019).

A plausible implication is that closing the gap between existence and efficient algorithmic realizability—or between NTK regime and feature learning—represents a key frontier in the theory of over-parameterized neural networks.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Logarithmic Over-Parameterization.