On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization (1802.06509v2)

Published 19 Feb 2018 in cs.LG

Abstract: Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization - linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with $\ell_p$ loss, $p>2$, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. We also prove that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

Citations (465)

View on Semantic Scholar

Summary

The paper demonstrates that overparameterization in deep networks serves as an implicit preconditioning mechanism that accelerates gradient descent.
It employs linear neural networks to decouple depth from model expressiveness, isolating the pure impact of overparameterization on optimization speed.
Empirical results reveal that deeper architectures converge faster and sometimes outperform standard acceleration methods like AdaGrad and AdaDelta.

Implicit Acceleration by Overparameterization in Deep Networks

The paper, "On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization," investigates an intriguing aspect of deep learning concerning the optimization process in overparameterized networks. The paper challenges conventional wisdom which suggests that increasing the depth of neural networks primarily enhances expressiveness while complicating optimization.

Key Insights and Methodology

The authors explore the duality where additional layers in a network do not merely complicate its training but may actually serve as a preconditioning mechanism to accelerate convergence. Overparameterization refers to expanding the network beyond what is minimally necessary to represent the training data, specifically by decomposing a model into multiple layers without changing its expressive capacity.

The analysis is built upon linear neural networks (LNNs), where depth is introduced without altering the inherent model expressiveness. This choice ensures that any observed differences in optimization speed can be attributed solely to the overparameterization effect and not to increased expressiveness.

Theoretical Framework

The theoretical contribution of the paper lays in formulating the dynamics of gradient descent in deep linear networks. Through continuous-time differential equations, the paper demonstrates that breaking down a parameter matrix into the product of several matrices can emulate certain acceleration techniques like momentum and adaptive learning rates. This is notable even in convex settings such as linear regression with $\ell_p$ loss where $p>2$ .

The authors argue and empirically verify that the effect of increasing depth in these networks is equivalent to a particular form of preconditioning, where both learning rates and directional preferences are adaptive to the optimization path already taken. Furthermore, they prove that this effect cannot be replicably achieved through standard regularization techniques applied to simple linear models.

Numerical Results

Experimentally, the paper contrasts gradient descent with overparameterized models against traditional linear models and established acceleration methods in practical settings. Results show the acceleration effect is competitive, and in some cases, surpasses that achieved through AdaGrad or AdaDelta. However, the effect varies with different network configurations and objective functions, such as the transition from $\ell_2$ to $\ell_4$ regression losses.

The empirical findings support the theoretical results, showing that networks of varying depths can converge at different rates, with deeper architectures potentially providing faster convergence without the constraint of increased computational load, as width does not impact the optimization effect, only depth does.

Future Implications and Research Directions

This exploration invites further academic inquiry into how fundamental restructuring of neural architectures affects both theoretical and empirical optimization characteristics. While the findings predominantly pertain to linear networks, extending these insights into non-linear domains represents a promising avenue. Additionally, characterizing the broader class of problems and conditions under which overparameterization leads to acceleration can significantly benefit the deployment of large-scale neural systems.

Future research could focus on bridging these theoretical insights with practical tools and frameworks, potentially leading to novel neural network architectures that strategically harness overparameterization for enhanced training efficacy.

The paper contributes to the intricate puzzle of understanding how neural network structures inherently facilitate optimization dynamics, challenging the prevailing balance between depth, expressiveness, and trainability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stanislavfort/status/1889399971796910156

https://twitter.com/stanislavfort/status/1889400118144872832

https://twitter.com/stanislavfort/status/1889400413629354179