Stacking as Accelerated Gradient Descent (2403.04978v2)
Abstract: Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Understanding nesterov’s acceleration via proximal point method. In Bringmann, K. and Chan, T. (eds.), 5th Symposium on Simplicity in Algorithms, SOSA@SODA 2022, Virtual Conference, January 10-11, 2022, pp. 117–130. SIAM, 2022.
- Potential-function proofs for first-order methods, 2017.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL-HLT, pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
- Efficient training of bert by progressively stacking. In International conference on machine learning, pp. 2337–2346. PMLR, 2019.
- Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
- Shampoo: Preconditioned stochastic tensor optimization. In Dy, J. G. and Krause, A. (eds.), ICML, volume 80 of Proceedings of Machine Learning Research, pp. 1837–1845. PMLR, 2018. URL http://proceedings.mlr.press/v80/gupta18a.html.
- Identity matters in deep learning. In ICLR. OpenReview.net, 2017. URL https://openreview.net/forum?id=ryxB0Rtxx.
- Deep residual learning for image recognition. In CVPR, pp. 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
- A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. R. and Blei, D. M. (eds.), ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Kawaguchi, K. Deep learning without poor local minima. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 586–594, 2016.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- Boosting algorithms as gradient descent. In Solla, S. A., Leen, T. K., and Müller, K. (eds.), NeurIPS, pp. 512–518. The MIT Press, 1999. URL http://papers.nips.cc/paper/1766-boosting-algorithms-as-gradient-descent.
- Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady an ussr, volume 269, pp. 543–547, 1983.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Efficient training of language models using few-shot learning. In International Conference on Machine Learning, pp. 14553–14568. PMLR, 2023.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Bengio, Y. and LeCun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings, 2014.
- Boosting: Foundations and Algorithms. The MIT Press, 2012. ISBN 0262017180.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Staged training for transformer language models. In International Conference on Machine Learning, pp. 19893–19908. PMLR, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Linear coupling: An ultimate unification of gradient and mirror descent. In Papadimitriou, C. H. (ed.), 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pp. 3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.