Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs (1702.07966v1)

Published 26 Feb 2017 in cs.LG, math.OC, and stat.ML

Abstract: Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReLU activation function. For this architecture we show that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time. To the best of our knowledge, this is the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Alon Brutzkus (10 papers)
  2. Amir Globerson (87 papers)
Citations (308)

Summary

  • The paper demonstrates that while learning the filter is NP-complete for arbitrary data, Gaussian inputs make the optimization tractable.
  • It derives a closed-form expression for the population risk in a no-overlap ConvNet and proves gradient descent converges in O(1/ε²) iterations.
  • Empirical results confirm GD’s global convergence on Gaussian data and highlight performance gaps when data deviates from these assumptions.

This paper (Brutzkus et al., 2017 ) investigates the optimization landscape and convergence properties of gradient descent for training a specific type of convolutional neural network (ConvNet) with a single hidden layer and ReLU activation. The core motivation is to theoretically understand why gradient descent is empirically successful for deep learning despite the non-convex nature of the objective functions. The paper focuses on the realizable case, where the training data is generated by the true network, and analyzes the population risk (expected squared error over the data distribution) rather than the empirical risk.

The specific architecture studied is a simplified ConvNet called a "no-overlap network". It consists of:

  1. Applying a single filter wRmw \in \mathbb{R}^m to k=d/mk = d/m non-overlapping segments of the input vector xRdx \in \mathbb{R}^d.
  2. Passing the results through an element-wise ReLU activation function (σ(z)=max{0,z}\sigma(z) = \max\{0, z\}).
  3. Averaging the outputs of the kk hidden units (average pooling). The network output is f(x;w)=1ki=1kσ(wxi)f(x; w) = \frac{1}{k} \sum_{i=1}^k \sigma(w \cdot x_i), where xix_i is the ii-th non-overlapping segment of xx.

The paper presents two main theoretical results demonstrating a significant gap between the worst-case hardness and distribution-dependent tractability of learning this architecture:

  1. Worst-Case Hardness: The paper shows that learning the weight vector ww by minimizing the population risk is NP-complete if the input distribution is arbitrary. This is proven by a reduction from a variant of the Set-Splitting problem (Set-Splitting-by-k-Sets), which is shown to be NP-complete for k2k \geq 2. The practical implication is that without assumptions on the data distribution, finding the globally optimal weights for even this simple network is computationally intractable in the worst case.
  2. Distribution-Dependent Tractability (Gaussian Inputs): If the input features xx are generated from a distribution where entries are independent and identically distributed (IID) Gaussian random variables (with zero mean and unit variance for simplicity), the picture changes dramatically. The paper shows that, in this case, the population risk (w)\ell(w) has a closed-form expression (derived using a known kernel for ReLU activations on Gaussian inputs, Lemma 3.1 & 3.2). Analyzing this loss function (Lemma 4.1), they find that for k>1k > 1, it has a non-differentiable point at w=0w=0, a local maximum at w=0w=0, a unique global minimum at w=ww=w^*, and a single degenerate saddle point. Critically, they prove (Theorem 4.2) that simple gradient descent, initialized randomly from the unit sphere, converges to an ϵ\epsilon-accurate solution (in terms of loss) in O(1/ϵ2)O(1/\epsilon^2) iterations with high probability. This translates to polynomial time convergence for the k-Non-Overlap-Opt problem (Corollary 4.3), meaning GD can find a near-optimal solution efficiently under the Gaussian assumption.

Practical Implications for Implementation:

  • Data is Key: The most crucial takeaway is that the nature of the input data distribution fundamentally impacts the feasibility of finding the global optimum using gradient descent. For the specific no-overlap architecture, Gaussian data makes the non-convex optimization problem tractable for GD, while arbitrary data makes it NP-hard.
  • Algorithm Choice: For the no-overlap case with Gaussian-like data, standard gradient descent (or variants like AdaGrad as used in experiments) is theoretically guaranteed to converge to the global minimum. For non-Gaussian data, or for more complex architectures, simple GD may get stuck in local minima.
  • Optimization Landscape: The theoretical analysis provides insight into the loss surface under Gaussian inputs, characterizing the types of critical points. This understanding can potentially inform the development of more sophisticated optimization algorithms if needed, although the paper shows simple GD suffices here.

Empirical Evidence:

The paper includes empirical experiments (Section 5) to support the theoretical findings. They construct a "hard" dataset based on the NP-hardness reduction from Set-Splitting and an "easy" dataset using IID Gaussian inputs, both labeled by the same ground truth filter ww^*. Training is performed using AdaGrad on the empirical risk for both datasets. The results (Figure 4) show that AdaGrad gets stuck at a suboptimal point on the hard dataset, while it converges to the global minimum (ww^*) on the Gaussian dataset. This vividly demonstrates the practical difference distribution makes.

Limitations and Extensions (Overlapping Filters):

The paper then explores what happens when the "no-overlap" restriction is relaxed, considering networks with overlapping filters (Section 6). They show that even for a simple 2D filter with stride 1, the loss function becomes more complex, and crucially, simple gradient descent can get stuck in suboptimal regions (demonstrated analytically for a 2D case and empirically with Figure 5 showing a prominent suboptimal region). This suggests that the strong global convergence guarantee of simple GD is specific to the no-overlap case with Gaussian data.

However, empirical studies on overlapping filters across various dimensions, filter sizes, and strides (Section 6.2) suggest that the basin of attraction for the unique global minimum still occupies a non-negligible portion of the initialization space (estimated at probability 1/17\geq 1/17). This practical finding suggests that while simple GD isn't guaranteed to find the global optimum from any random initialization, running GD with a small number of random restarts could still find the global minimum with high probability for overlapping filters with Gaussian data.

Implementation Considerations:

  • The analysis is for population risk. Applying these results directly to empirical risk requires assuming the empirical risk landscape closely approximates the population risk landscape, which is often justified for large datasets (e.g., using concentration bounds, though this is left for future formal proof).
  • The results are for a single hidden layer and average pooling. Extending them to deeper networks or max pooling requires further research.
  • The realizable case assumption simplifies the problem (the global minimum has zero loss). Non-realizable settings (where the true function is not representable by the network) are more complex.
  • The theoretical convergence rate O(1/ϵ2)O(1/\epsilon^2) is standard for achieving an ϵ\epsilon-norm gradient. The practical iteration count depends on the constants involved, which are influenced by network size (kk, dd), filter size (mm), and properties of ww^*.

In summary, this paper provides valuable theoretical insights into the factors affecting optimization in simplified ConvNets. It rigorously demonstrates that data distribution is a critical factor in determining whether non-convex training can be guaranteed to succeed with simple gradient-based methods, identifying the no-overlap ConvNet with ReLU and Gaussian inputs as a notable tractable case. It also highlights that even small architectural changes like allowing filter overlap can reintroduce optimization challenges, though empirical results hint that random restarts might offer a practical workaround in such cases.