Linear Recursive Feature Machines provably recover low-rank matrices (2401.04553v1)
Abstract: A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
- A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
- Anthropic AI. Claude 2: The next generation language model. Online, 2023.
- A convergence analysis of gradient descent for deep linear neural networks. In International Conference on Learning Representations, 2019.
- On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning (ICML), 2018.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, 2019.
- S. Axler. Measure, Integration & Real Analysis. Springer, 2023.
- Mechanism of feature learning in convolutional neural networks. arXiv preprint arXiv:2309.00570, 2023.
- A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982, 2010.
- E. Candès and B. Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
- E. J. Candès and T. Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
- E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. Institute of Electrical and Electronics Engineers Transactions on Information Theory, 56(5):2053–2080, 2010.
- Enhancing sparsity by reweighted ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT minimization. Journal of Fourier analysis and applications, 14:877–905, 2008.
- R. Chartrand and W. Yin. Iteratively reweighted algorithms for compressive sensing. In 2008 IEEE international conference on acoustics, speech and signal processing, pages 3869–3872. IEEE, 2008.
- Representation costs of linear neural networks: Analysis and design. Advances in Neural Information Processing Systems, 34:26884–26896, 2021.
- Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
- Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63(1):1–38, 2010.
- Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
- S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
- Flat minima generalize for low-rank matrix recovery. arXiv preprint arXiv:2203.03756, 2022.
- D. L. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
- Uncertainty principles and signal recovery. SIAM Journal on Applied Mathematics, 49(3):906–931, 1989.
- J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
- Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In Proceedings of the 2003 American Control Conference, 2003., volume 3, pages 2156–2162. IEEE, 2003.
- Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640, 2011.
- Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, 2017.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- A. Jacot. Bottleneck structure in learned features: Low-dimension vs regularity tradeoff. arXiv preprint arXiv:2305.19008, 2023.
- A. Jacot. Implicit bias of large depth networks: A notion of rank for nonlinear functions. In International Conference on Learning Representations, 2023.
- A. S. Lewis. Derivatives of spectral functions. Mathematics of Operations Research, 21(3):576–588, 1996.
- Y. Li. A globally convergent method for ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT problems. SIAM Journal on Optimization, 3(3):609–629, 1993.
- G. Merle and H. Späth. Computational experiences with discrete lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-approximation. Computing, 12:315–321, 1974.
- K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. The Journal of Machine Learning Research, 13(1):3441–3473, 2012.
- Neural networks efficiently learn low-dimensional representations with sgd. In International Conference on Learning Representations, 2023.
- Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR, 2022.
- OpenAI. Gpt-4 turbo: Enhanced version of gpt-4 language model. Online, 2023.
- Linear neural network layers promote learning single-and multiple-index models. arXiv preprint arXiv:2305.15598, 2023.
- Feature learning in neural networks and kernel machines that recursively learn features. arXiv preprint arXiv:2212.13881, 2022.
- B. D. Rao and K. Kreutz-Delgado. An affine scaling methodology for best basis selection. IEEE Transactions on signal processing, 47(1):187–200, 1999.
- N. Razin and N. Cohen. Implicit regularization in deep learning may not be explainable by norms. Advances in neural information processing systems, 33:21174–21187, 2020.
- Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Society for Industrial and Applied Mathematics Review, 52(3):471–501, 2010.
- A unified scalable equivalent formulation for schatten quasi-norms. Mathematics, 8(8):1325, 2020.
- A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
- T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- M. Turk and A. Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
- The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
- Wikipedia contributors. Compressed sensing — Wikipedia, the free encyclopedia, 2023. [Online; accessed 16-December-2023].
- Wikipedia contributors. Nonlinear dimensionality reduction — Wikipedia, the free encyclopedia, 2023. [Online; accessed 16-December-2023].
- Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
- J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.
- G. Yang and E. J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, 2021.
- Adityanarayanan Radhakrishnan (22 papers)
- Mikhail Belkin (76 papers)
- Dmitriy Drusvyatskiy (60 papers)