Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-dimensional Tokens (2410.18858v2)
Abstract: Current progress in artificial intelligence is centered around so-called LLMs that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (aka generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens, and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.
- Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature 521, 436 (2015).
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25 (2012).
- L. Zdeborová, Understanding deep learning is also a job for physicists, Nature Physics 16, 602 (2020).
- J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities., Proceedings of the national academy of sciences 79, 2554 (1982).
- D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for boltzmann machines, Cognitive science 9, 147 (1985).
- E. Gardner and B. Derrida, Optimal storage properties of neural network models, Journal of Physics A: Mathematical and general 21, 271 (1988).
- E. Gardner and B. Derrida, Three unfinished works on the optimal storage capacity of networks, Journal of Physics A: Mathematical and General 22, 1983 (1989).
- H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Physical review A 45, 6056 (1992).
- A. M. Saxe, J. L. McClelland, and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, International Conference on Learning Representations (2014).
- M. S. Advani and A. M. Saxe, High-dimensional dynamics of generalization error in neural networks, Neural Networks 132, 428 (2020).
- G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2, 303 (1989).
- K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural networks 2, 359 (1989).
- F. Cagnetta and M. Wyart, Towards a theory of how the structure of language is acquired by deep neural networks, arXiv preprint arXiv:2406.00048 (2024).
- F. Behrens, L. Biggio, and L. Zdeborová, Understanding counting in small transformers: The interplay between attention and feed-forward layers, arXiv preprint arXiv:2407.11542 (2024).
- F. Mignacco, P. Urbani, and L. Zdeborová, Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem, Machine Learning: Science and Technology 2, 035029 (2021).
- B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review 52, 471 (2010).
- D. L. Donoho, M. Gavish, and A. Montanari, The phase transition of matrix recovery from gaussian measurements matches the minimax mse of matrix denoising, Proceedings of the National Academy of Sciences 110, 8405–8410 (2013).
- C. Schülke, P. Schniter, and L. Zdeborová, Phase diagram of matrix compressed sensing, Phys. Rev. E 94, 062136 (2016).
- Z. Li, Y. Luo, and K. Lyu, Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning, in International Conference on Learning Representations (2021).
- H. C. Schmidt, Statistical physics of sparse and dense models in optimization and inference, Ph.D. thesis, Université Paris Saclay (COmUE) (2018).
- J. Barbier and N. Macris, Statistical limits of dictionary learning: random matrix theory and the spectral replica method, Physical Review E 106, 024136 (2022).
- G. Semerjian, Matrix denoising: Bayes-optimal estimators via low-degree polynomials, arXiv preprint arXiv:2402.16719 (2024).
- F. Camilli and M. Mézard, Matrix factorization with neural networks, Physical Review E 107, 064308 (2023).
- F. Pourkamali and N. Macris, Rectangular rotational invariant estimator for general additive noise matrices, in 2023 IEEE International Symposium on Information Theory (ISIT) (2023) pp. 2081–2086.
- Y. V. Fyodorov, A spin glass model for reconstructing nonlinearly encrypted signals corrupted by noise, Journal of Statistical Physics 175, 789 (2019).
- P. J. Kamali and P. Urbani, Dynamical mean field theory for models of confluent tissues and beyond, SciPost Physics 15, 219 (2023a).
- A. Montanari and E. Subag, Solving overparametrized systems of random equations: I. model and algorithms for approximate solutions, arXiv preprint arXiv:2306.13326 (2023).
- P. J. Kamali and P. Urbani, Stochastic gradient descent outperforms gradient descent in recovering a high-dimensional signal in a glassy energy landscape, arXiv preprint arXiv:2309.04788 (2023b).
- H. Hu and Y. M. Lu, Universality laws for high-dimensional learning with random features, IEEE Transactions on Information Theory 69, 1932 (2022).
- Z. Wang, E. Nichani, and J. D. Lee, Learning hierarchical polynomials with three-layer neural networks, in The Twelfth International Conference on Learning Representations (2024).
- T. M. Cover and J. A. Thomas, Information theory and statistics, Elements of information theory 1, 279 (1991).
- Z. Wang, E. Nichani, and J. D. Lee, Learning hierarchical polynomials with three-layer neural networks, in The Twelfth International Conference on Learning Representations (2024).
- L. Zdeborová and F. Krzakala, Statistical physics of inference: Thresholds and algorithms, Advances in Physics 65, 453 (2016).
- J. Pennington and P. Worah, Nonlinear random matrix theory for deep learning, Advances in Neural Information Processing Systems, 30 (2017).
- P. Biane, On the free convolution with a semi-circular distribution, Indiana University Mathematics Journal 46, 705 (1997).
- S. Rangan, Generalized approximate message passing for estimation with random linear mixing, in 2011 IEEE International Symposium on Information Theory Proceedings (2011) pp. 2168–2172.
- D. L. Donoho, A. Maleki, and A. Montanari, Message-passing algorithms for compressed sensing, Proceedings of the National Academy of Sciences 106, 18914 (2009).
- J. Barbier, J. Ko, and A. A. Rahman, Information-theoretic limits for sublinear-rank symmetric matrix factorization, in International Zurich Seminar on Information and Communication (IZS 2024). Proceedings (ETH Zürich, 2024) pp. 16–16.
- J. Barbier, J. Ko, and A. A. Rahman, A multiscale cavity method for sublinear-rank symmetric matrix factorization, arXiv preprint arXiv:2403.07189 (2024b).
- F. Pourkamali, J. Barbier, and N. Macris, Matrix inference in growing rank regimes, IEEE Transactions on Information Theory , 1 (2024).
- U. Helmke and J. B. Moore, Optimization and dynamical systems (Springer Science & Business Media, 2012).
- D. Donoho and M. Gavish, Minimax risk of matrix denoising by singular value thresholding, The Annals of Statistics 42, 2413 (2014).
- B. Neyshabur, R. Tomioka, and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614 (2014).
- S. Diamond and S. Boyd, CVXPY: A Python-embedded modeling language for convex optimization, Journal of Machine Learning Research 17, 1 (2016).
- S. D. Akshay Agrawal, Robin Verschueren and S. Boyd, A rewriting system for convex optimization problems, Journal of Control and Decision 5, 42 (2018).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.