Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Power of Over-parametrization in Neural Networks with Quadratic Activation (1803.01206v2)

Published 3 Mar 2018 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a $k$ hidden node shallow network with quadratic activation and $n$ training data points, we show as long as $ k \ge \sqrt{2n}$, over-parametrization enables local search algorithms to find a \emph{globally} optimal solution for general smooth and convex loss functions. Further, despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, we show with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian. To prove when $k\ge \sqrt{2n}$, the loss function has benign landscape properties, we adopt an idea from smoothed analysis, which may have other applications in studying loss surfaces of neural networks.

Citations (259)

Summary

  • The paper demonstrates that over-parameterization with quadratic activations leads to a loss landscape where all local minima are global, ensuring effective gradient descent optimization.
  • It establishes two over-parameterization regimes—one with width exceeding input dimensions and another with mild perturbations—that guarantee strict saddle points and global optima.
  • The study shows that weight decay regularizes the model by controlling the Frobenius norm, which results in strong generalization even in large networks.

This paper, "On the Power of Over-parametrization in Neural Networks with Quadratic Activation" (On the Power of Over-parametrization in Neural Networks with Quadratic Activation, 2018), investigates two key empirical phenomena observed in deep learning: why simple optimization algorithms like gradient descent can effectively train highly non-convex neural networks, and how these over-parameterized models generalize well despite having more parameters than training data points. The paper focuses specifically on a shallow neural network architecture using a quadratic activation function z2z^2 and a fixed second layer averaging the hidden node outputs, trained with an L2 weight decay regularization term.

The paper's main contributions are divided into two parts: optimization landscape analysis and generalization theory.

For optimization, the paper provides theoretical guarantees on the structure of the empirical loss landscape. It shows that under certain conditions of over-parameterization, the loss function exhibits desirable properties: all local minima are also global minima, and all saddle points are "strict" (meaning there's a direction of negative curvature). These properties are crucial because they ensure that standard first-order methods like gradient descent, when initialized randomly, are likely to converge to a global optimum. The paper analyzes two distinct conditions for over-parameterization that lead to this benign landscape:

  1. Width greater than input dimension: When the number of hidden nodes kk is greater than or equal to the input dimension dd (kdk \ge d), the non-perturbed training loss function (with L2 regularization) has the desired landscape properties for any convex, twice-differentiable loss function and any dataset.
  2. ** milder over-parameterization with perturbation:** A more novel finding is that the benign landscape properties can also be achieved with a milder over-parameterization condition, specifically when the number of hidden nodes kk satisfies k(k+1)/2>nk(k+1)/2 > n, where nn is the number of training data points. This condition can be significantly less stringent than kdk \ge d or knk \ge n, especially when nn is small compared to d2d^2. To prove this, the paper introduces a perturbed version of the training loss by adding a term C,WW\langle C, W^\top W \rangle, where CC is a small, random positive semidefinite matrix. The core idea, borrowing from smoothed analysis techniques, is that this small perturbation ensures that the optimization problem avoids a measure-zero set of ill-behaved landscapes, thus guaranteeing with high probability that all local minima are global and all saddle points are strict. Furthermore, the paper shows that the optimal value of this perturbed objective is arbitrarily close to the original objective's minimum value if the perturbation magnitude δ\delta is small.

These results imply that for quadratic networks, over-parameterization makes the non-convex optimization problem behave more like a convex one, facilitating training with simple gradient-based methods.

For generalization, the paper leverages the role of weight decay (L2 regularization). The L2 regularization term λ2WF2\frac{\lambda}{2} \|W\|_F^2 encourages the learned weight matrix WW to have a small Frobenius norm. The key observation is that WF2\|W\|_F^2 is related to the trace of WWW^\top W, and controlling the Frobenius norm of WW helps control the nuclear norm of WWW^\top W. The paper then uses techniques based on Rademacher complexity bounds to quantify the generalization ability of the learned network.

The generalization bound derived depends on the Frobenius norm of the learned weight matrix MM (i.e., WFM\|W\|_F \le M) and properties of the input data distribution.

  • For inputs from a bounded domain (x2b\|x\|_2 \le b), the generalization bound is O(LM2b2logdn)O(LM^2 b^2 \sqrt{\frac{\log d}{n}}), showing a dependency quadratic in dd.
  • For inputs sampled from specific "benign" distributions like isotropic Gaussian (xiN(0,I)x_i \sim N(0, I)), and assuming ndlogdn \ge d \log d, the bound is significantly better: O(LM2dn)O(LM^2 \sqrt{\frac{d}{n}}). This d/n\sqrt{d/n} dependency matches the parametric rate often seen in linear models and is much more favorable when dd is large.

These generalization results suggest that even with over-parameterization (kk potentially large), weight decay acts as an effective regularizer by limiting the complexity of the learned function class (implicitly via controlling the norm of WW), leading to good out-of-sample performance, especially when the data distribution has controlled properties (like bounded fourth moments for the Gaussian case).

In summary, this theoretical work provides a concrete framework using quadratic activation to demonstrate how over-parameterization can fundamentally change the optimization landscape to be more tractable for local search methods, and how standard weight decay enables learned solutions to generalize well, supporting common practices in training modern neural networks. While the quadratic activation is a simplification, the techniques employed (smoothed analysis, Rademacher complexity bounds tied to matrix norms) offer insights potentially applicable to more complex network architectures and activations.