Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time (2112.07628v2)
Abstract: We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width $m$ is much larger than the data dimension $d$ and the number of training samples $n$ ($m=\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\in \mathbb{R}{m\times m}$ per layer. Naively, one has to pay $O(m2)$ time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses $m2$ cost only in the initialization phase and achieves \emph{a truly subquadratic cost per iteration} in terms of $m$, i.e., $m{2-\Omega(1)}$ per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy LLMs (LLM).
- Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In ICML. arXiv preprint arXiv:1901.08584, 2019.
- Oblivious sketching of high-degree polynomial kernels. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 141–160, 2020.
- Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. In Advances in Neural Information Processing Systems, NeurIPS’23, 2023.
- Faster sparse minimum cost flow by electrical flow localization. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), 2021.
- Subspace embeddings for the polynomial kernel. In NeurIPS, 2014.
- A convergence theory for deep learning via over-parameterization. In ICML, 2019.
- On the convergence rate of training recurrent neural networks. In NeurIPS, 2019.
- Sergei Bernstein. On a modification of chebyshev’s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38–49, 1924.
- Exact natural gradient in deep linear networks and its application to the nonlinear case. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
- Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 775–788, 2020.
- Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
- Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
- Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning, pages 557–565, 2017.
- Optimal principal component analysis in distributed and streaming models. In STOC’16—Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, 2016.
- How much over-parameterization is sufficient to learn deep ReLU networks? In International Conference on Learning Representations (ICLR), 2021.
- Generalization bounds of stochastic gradient descent for wide and deep neural networks. In NeurIPS, pages 10835–10845, 2019.
- Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
- Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.
- {MONGOOSE}: A learnable {lsh} framework for efficient neural network training. In Proceedings of the Nineth International Conference on Learning Representations (ICLR’2021), 2021.
- Solving linear programs in the current matrix multiplication time. In STOC, 2019.
- SLIDE : In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. In MLSys’2020, 2020.
- Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81–90, 2013.
- Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), 2019.
- Newton-less: Sparsification without trade-offs for the sketched newton update, 2021.
- Faster approximate lossy generalized flow via interior point algorithms. In Proceedings of the fortieth annual ACM symposium on Theory of computing (STOC), pages 451–460, 2008.
- Fast distance oracles for any symmetric norm. In NeurIPS, 2022.
- Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019.
- Improved sliding window algorithms for clustering and coverage via bucketing-based sketches. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2022.
- A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
- A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
- Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, page 1029–1046, 2018.
- Solving sdp faster: A robust ipm framework and efficient implementation. In FOCS, 2022.
- Fl-ntk: A neural tangent kernel-based framework for federated learning convergence analysis. In ICML, 2021.
- Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
- Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 8580–8589, 2018.
- A faster interior point method for semidefinite programming. In FOCS, 2020.
- An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
- A faster interior-point method for sum-of-squares optimization. In ICALP, 2022.
- Faster dynamic matrix inverse for faster lps. In STOC, 2021.
- Last layer re-training is sufficient for robustness to spurious correlations, 2022.
- Suprema of chaos processes and the restricted isometry property. Communications on Pure and Applied Mathematics, 67(11):1877–1904, 2014.
- Francois Le Gall. Faster rectangular matrix multiplication by combination loss analysis. In SODA, 2024.
- Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, 2018.
- Path finding methods for linear programming: Solving linear programs in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 424–433, 2014.
- Generalized leverage score sampling for neural networks. In NeurIPS, 2020.
- A faster cutting plane method and its implications for combinatorial and convex optimization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 1049–1065. IEEE, 2015.
- Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory (COLT), pages 2140–2157. PMLR, 2019.
- Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In ICALP, 2023.
- Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8570–8581, 2019.
- Relu strikes back: Exploiting activation sparsity in large language models, 2023.
- Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 2408–2417. JMLR.org, 2015.
- Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 101–110, 2013.
- Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
- Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim., 27:205–245, 2017.
- An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
- Restricted isometries for partial random circulant matrices. Applied and Computational Harmonic Analysis, 32(2):242–254, 2012.
- Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62(12):1707–1739, 2009.
- Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143–152. IEEE, 2006.
- J. Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. Journal für die reine und angewandte Mathematik, 140, 1911.
- Sparse fourier transform over lattices: A unified approach to signal reconstruction. arXiv preprint arXiv:2205.00658, 2022.
- Efficient symmetric norm regression via linear sketching. Advances in Neural Information Processing Systems, 32, 2019.
- Fast sketching of polynomial kernels of polynomial degree. In ICML, 2021.
- Low rank approximation with entrywise l1-norm error. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 688–701, 2017.
- Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2772–2789. SIAM, 2019.
- Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
- Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning (ICML), pages 9835–9847. PMLR, 2021.
- Does preprocessing help training over-parameterized neural networks? In Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science, pages 332–337. IEEE, 1989.
- New bounds for matrix multiplication: from alpha to omega. In SODA, 2024.
- Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages 887–898. ACM, 2012.
- David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020.
- Guanghao Ye. Fast algorithm for solving structured convex programs. University of Washington Undergraduate Thesis, 2020.
- Adahessian: An adaptive second order optimizer for machine learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10665–10673, May 2021.
- Convergence and generalization of wide neural networks with large bias, 2023.
- Gradient descent optimizes over-parameterized deep relu networks. In Machine Learning, 2020.
- An improved analysis of training over-parameterized deep neural networks. In NeurIPS, pages 2053–2062, 2019.
- Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.