On the Power of Over-parametrization in Neural Networks with Quadratic Activation (1803.01206v2)

Published 3 Mar 2018 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a $k$ hidden node shallow network with quadratic activation and $n$ training data points, we show as long as $ k \ge \sqrt{2n}$, over-parametrization enables local search algorithms to find a \emph{globally} optimal solution for general smooth and convex loss functions. Further, despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, we show with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian. To prove when $k\ge \sqrt{2n}$, the loss function has benign landscape properties, we adopt an idea from smoothed analysis, which may have other applications in studying loss surfaces of neural networks.

Citations (259)

View on Semantic Scholar

Summary

The paper demonstrates that over-parameterization with quadratic activations leads to a loss landscape where all local minima are global, ensuring effective gradient descent optimization.
It establishes two over-parameterization regimes—one with width exceeding input dimensions and another with mild perturbations—that guarantee strict saddle points and global optima.
The study shows that weight decay regularizes the model by controlling the Frobenius norm, which results in strong generalization even in large networks.

This paper, "On the Power of Over-parametrization in Neural Networks with Quadratic Activation" (On the Power of Over-parametrization in Neural Networks with Quadratic Activation, 2018), investigates two key empirical phenomena observed in deep learning: why simple optimization algorithms like gradient descent can effectively train highly non-convex neural networks, and how these over-parameterized models generalize well despite having more parameters than training data points. The paper focuses specifically on a shallow neural network architecture using a quadratic activation function $z^2$ and a fixed second layer averaging the hidden node outputs, trained with an L2 weight decay regularization term.

The paper's main contributions are divided into two parts: optimization landscape analysis and generalization theory.

For optimization, the paper provides theoretical guarantees on the structure of the empirical loss landscape. It shows that under certain conditions of over-parameterization, the loss function exhibits desirable properties: all local minima are also global minima, and all saddle points are "strict" (meaning there's a direction of negative curvature). These properties are crucial because they ensure that standard first-order methods like gradient descent, when initialized randomly, are likely to converge to a global optimum. The paper analyzes two distinct conditions for over-parameterization that lead to this benign landscape:

Width greater than input dimension: When the number of hidden nodes $k$ is greater than or equal to the input dimension $d$ ( $k \ge d$ ), the non-perturbed training loss function (with L2 regularization) has the desired landscape properties for any convex, twice-differentiable loss function and any dataset.
** milder over-parameterization with perturbation:** A more novel finding is that the benign landscape properties can also be achieved with a milder over-parameterization condition, specifically when the number of hidden nodes $k$ satisfies $k(k+1)/2 > n$ , where $n$ is the number of training data points. This condition can be significantly less stringent than $k \ge d$ or $k \ge n$ , especially when $n$ is small compared to $d^2$ . To prove this, the paper introduces a perturbed version of the training loss by adding a term $\langle C, W^\top W \rangle$ , where $C$ is a small, random positive semidefinite matrix. The core idea, borrowing from smoothed analysis techniques, is that this small perturbation ensures that the optimization problem avoids a measure-zero set of ill-behaved landscapes, thus guaranteeing with high probability that all local minima are global and all saddle points are strict. Furthermore, the paper shows that the optimal value of this perturbed objective is arbitrarily close to the original objective's minimum value if the perturbation magnitude $\delta$ is small.

These results imply that for quadratic networks, over-parameterization makes the non-convex optimization problem behave more like a convex one, facilitating training with simple gradient-based methods.

For generalization, the paper leverages the role of weight decay (L2 regularization). The L2 regularization term $\frac{\lambda}{2} \|W\|_F^2$ encourages the learned weight matrix $W$ to have a small Frobenius norm. The key observation is that $\|W\|_F^2$ is related to the trace of $W^\top W$ , and controlling the Frobenius norm of $W$ helps control the nuclear norm of $W^\top W$ . The paper then uses techniques based on Rademacher complexity bounds to quantify the generalization ability of the learned network.

The generalization bound derived depends on the Frobenius norm of the learned weight matrix $M$ (i.e., $\|W\|_F \le M$ ) and properties of the input data distribution.

For inputs from a bounded domain ( $\|x\|_2 \le b$ ), the generalization bound is $O(LM^2 b^2 \sqrt{\frac{\log d}{n}})$ , showing a dependency quadratic in $d$ .
For inputs sampled from specific "benign" distributions like isotropic Gaussian ( $x_i \sim N(0, I)$ ), and assuming $n \ge d \log d$ , the bound is significantly better: $O(LM^2 \sqrt{\frac{d}{n}})$ . This $\sqrt{d/n}$ dependency matches the parametric rate often seen in linear models and is much more favorable when $d$ is large.

These generalization results suggest that even with over-parameterization ( $k$ potentially large), weight decay acts as an effective regularizer by limiting the complexity of the learned function class (implicitly via controlling the norm of $W$ ), leading to good out-of-sample performance, especially when the data distribution has controlled properties (like bounded fourth moments for the Gaussian case).

In summary, this theoretical work provides a concrete framework using quadratic activation to demonstrate how over-parameterization can fundamentally change the optimization landscape to be more tractable for local search methods, and how standard weight decay enables learned solutions to generalize well, supporting common practices in training modern neural networks. While the quadratic activation is a simplification, the techniques employed (smoothed analysis, Rademacher complexity bounds tied to matrix norms) offer insights potentially applicable to more complex network architectures and activations.

PDF Markdown

On the Power of Over-parametrization in Neural Networks with Quadratic Activation (1803.01206v2)

Summary

Related Papers