Papers
Topics
Authors
Recent
2000 character limit reached

Stacking as Accelerated Gradient Descent (2403.04978v2)

Published 8 Mar 2024 in cs.LG and stat.ML

Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Understanding nesterov’s acceleration via proximal point method. In Bringmann, K. and Chan, T. (eds.), 5th Symposium on Simplicity in Algorithms, SOSA@SODA 2022, Virtual Conference, January 10-11, 2022, pp.  117–130. SIAM, 2022.
  3. Potential-function proofs for first-order methods, 2017.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.
  8. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL-HLT, pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  11. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  12. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.  1189–1232, 2001.
  13. Efficient training of bert by progressively stacking. In International conference on machine learning, pp. 2337–2346. PMLR, 2019.
  14. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  15. On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
  16. Shampoo: Preconditioned stochastic tensor optimization. In Dy, J. G. and Krause, A. (eds.), ICML, volume 80 of Proceedings of Machine Learning Research, pp.  1837–1845. PMLR, 2018. URL http://proceedings.mlr.press/v80/gupta18a.html.
  17. Identity matters in deep learning. In ICLR. OpenReview.net, 2017. URL https://openreview.net/forum?id=ryxB0Rtxx.
  18. Deep residual learning for image recognition. In CVPR, pp.  770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
  19. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  20. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. R. and Blei, D. M. (eds.), ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp.  448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
  21. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  22. Kawaguchi, K. Deep learning without poor local minima. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 586–594, 2016.
  23. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
  24. Boosting algorithms as gradient descent. In Solla, S. A., Leen, T. K., and Müller, K. (eds.), NeurIPS, pp.  512–518. The MIT Press, 1999. URL http://papers.nips.cc/paper/1766-boosting-algorithms-as-gradient-descent.
  25. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady an ussr, volume 269, pp.  543–547, 1983.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Efficient training of language models using few-shot learning. In International Conference on Machine Learning, pp. 14553–14568. PMLR, 2023.
  28. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Bengio, Y. and LeCun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings, 2014.
  29. Boosting: Foundations and Algorithms. The MIT Press, 2012. ISBN 0262017180.
  30. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  31. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  32. Staged training for transformer language models. In International Conference on Machine Learning, pp. 19893–19908. PMLR, 2022.
  33. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  34. Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334, 2023.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Linear coupling: An ultimate unification of gradient and mirror descent. In Papadimitriou, C. H. (ed.), 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pp.  3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
Citations (2)

Summary

  • The paper shows that stacking initialization emulates Nesterov’s accelerated gradient descent, yielding faster convergence in deep models.
  • It develops a theoretical framework linking various initialization strategies to their distinct convergence properties.
  • Empirical results on synthetic and real data validate stacking’s superior performance in accelerating training of deep residual networks.

Understanding the Efficacy of Stacking for Training Deep Networks

Introduction to Stacking in Deep Learning

In the landscape of deep learning, the efficiency of model training is a paramount concern, especially as we scale to larger models. A technique that has gained attention for its effectiveness in training deep models, particularly deep transformer models, is called "stacking." Stacking refers to a training strategy where a deep network is built and trained in stages, by progressively adding new layers and initializing these layers by copying parameters from existing layers. Recent studies have highlighted the potential of stacking for significantly speeding up the training of large transformer models. However, a comprehensive theoretical understanding of why stacking works so well has been lacking.

The Accelerated Gradient Perspective on Stacking

Our work explores the theoretical underpinnings of stacking and proposes that its success can be attributed to the manner in which it emulates a form of Nesterov's accelerated gradient descent (AGD) in function space. This perspective not only sheds light on the theoretical foundations of stacking but also unifies it with a fundamental optimization method known for its fast convergence properties.

Stagewise Training and Initialization Strategies

We consider stagewise training, where models are trained in stages, each adding new functions to an ensemble to minimize a loss function. We analyze different initialization strategies for adding new functions: zero initialization, random initialization, and stacking initialization. Through a theoretical framework, we establish connections between these initialization strategies and their implications on the convergence properties of the overall training process.

  1. Zero Initialization: Leads to functional gradient descent, recovering well-known results in the context of boosting and providing new insights for residual compositional models.
  2. Random Initialization: Results in stochastic functional gradient descent on a smoothed version of the loss function.
  3. Stacking Initialization: Remarkably, when applied to additive models, stacking initialization recovers Nesterov's accelerated functional gradient descent, offering accelerated convergence rates compared to zero initialization.

Accelerated Convergence with Stacking in Deep Linear Networks

To demonstrate the benefits of stacking more concretely, we analyze deep linear residual networks under a certain loss function and initialization conditions. We prove that stacking, with appropriate modifications, provides accelerated training comparable to Nesterov's accelerated method. This result hinges on a novel analysis of Nesterov's method that accounts for errors in updates, which could be of independent interest.

Empirical Validation

We complement our theoretical contributions with empirical studies on synthetic and real-world data, validating the accelerated convergence phenomenon with stacking, particularly in comparison to other initialization strategies. Our experiments demonstrate the advantages of stacking in practical deep learning settings.

Conclusion and Outlook

This work provides a theoretical foundation for understanding the success of stacking in training deep neural models, particularly highlighting its connection to accelerated gradient methods. The insights gained open several avenues for future research, including exploring efficiently implementable initialization schemes that could further harness the power of acceleration principles in deep learning. Additionally, extending the theoretical results to non-linear and more general settings remains an exciting challenge.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 4 tweets with 65 likes about this paper.