Stacking as Accelerated Gradient Descent (2403.04978v2)

Published 8 Mar 2024 in cs.LG and stat.ML

Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

References (36)

Citations (2)

View on Semantic Scholar

Summary

The paper shows that stacking initialization emulates Nesterov’s accelerated gradient descent, yielding faster convergence in deep models.
It develops a theoretical framework linking various initialization strategies to their distinct convergence properties.
Empirical results on synthetic and real data validate stacking’s superior performance in accelerating training of deep residual networks.

Understanding the Efficacy of Stacking for Training Deep Networks

Introduction to Stacking in Deep Learning

In the landscape of deep learning, the efficiency of model training is a paramount concern, especially as we scale to larger models. A technique that has gained attention for its effectiveness in training deep models, particularly deep transformer models, is called "stacking." Stacking refers to a training strategy where a deep network is built and trained in stages, by progressively adding new layers and initializing these layers by copying parameters from existing layers. Recent studies have highlighted the potential of stacking for significantly speeding up the training of large transformer models. However, a comprehensive theoretical understanding of why stacking works so well has been lacking.

The Accelerated Gradient Perspective on Stacking

Our work explores the theoretical underpinnings of stacking and proposes that its success can be attributed to the manner in which it emulates a form of Nesterov's accelerated gradient descent (AGD) in function space. This perspective not only sheds light on the theoretical foundations of stacking but also unifies it with a fundamental optimization method known for its fast convergence properties.

Stagewise Training and Initialization Strategies

We consider stagewise training, where models are trained in stages, each adding new functions to an ensemble to minimize a loss function. We analyze different initialization strategies for adding new functions: zero initialization, random initialization, and stacking initialization. Through a theoretical framework, we establish connections between these initialization strategies and their implications on the convergence properties of the overall training process.

Zero Initialization: Leads to functional gradient descent, recovering well-known results in the context of boosting and providing new insights for residual compositional models.
Random Initialization: Results in stochastic functional gradient descent on a smoothed version of the loss function.
Stacking Initialization: Remarkably, when applied to additive models, stacking initialization recovers Nesterov's accelerated functional gradient descent, offering accelerated convergence rates compared to zero initialization.

Accelerated Convergence with Stacking in Deep Linear Networks

To demonstrate the benefits of stacking more concretely, we analyze deep linear residual networks under a certain loss function and initialization conditions. We prove that stacking, with appropriate modifications, provides accelerated training comparable to Nesterov's accelerated method. This result hinges on a novel analysis of Nesterov's method that accounts for errors in updates, which could be of independent interest.

Empirical Validation

We complement our theoretical contributions with empirical studies on synthetic and real-world data, validating the accelerated convergence phenomenon with stacking, particularly in comparison to other initialization strategies. Our experiments demonstrate the advantages of stacking in practical deep learning settings.

Conclusion and Outlook

This work provides a theoretical foundation for understanding the success of stacking in training deep neural models, particularly highlighting its connection to accelerated gradient methods. The insights gained open several avenues for future research, including exploring efficiently implementable initialization schemes that could further harness the power of acceleration principles in deep learning. Additionally, extending the theoretical results to non-linear and more general settings remains an exciting challenge.