MLPs Learn In-Context on Regression and Classification Tasks (2405.15618v2)
Abstract: In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique haLLMark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context competitively with Transformers given the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging strong prior arguments about MLPs' limited ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs, and support the growing interest in all-MLP alternatives to task-specific architectures.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
- A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, 2024.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
- Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In International Conference on Learning Representations, 2024.
- Asymptotic theory of in-context learning by linear attention. arXiv preprint arXiv:2405.11751, 2024.
- Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36, 2024.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1):38, 2019.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems, 36, 2024.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Relational constraints on neural networks reproduce human biases towards abstract geometric regularity. arXiv preprint arXiv:2309.17363, 2023.
- Burrhus Frederic Skinner. Are theories of learning necessary? Psychological review, 57(4):193, 1950.
- Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human singularity. Proceedings of the National Academy of Sciences, 118(16):e2023123118, 2021.
- Pay attention to mlps. Advances in neural information processing systems, 34:9204–9215, 2021.
- pnlp-mixer: An efficient all-mlp architecture for language. arXiv preprint arXiv:2202.04350, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- How many pretraining tasks are needed for in-context learning of linear regression? In International Conference on Learning Representations, 2024.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36, 2024.
- Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pages 19565–19594. PMLR, 2023.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2021.
- In-context learning through the bayesian prism. In International Conference on Learning Representations, 2024.
- Exploring the relationship between model architecture and in-context learning ability. In International Conference on Learning Representations, 2024.
- Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
- Relational reasoning and generalization using nonsymbolic neural networks. Psychological Review, 130(2):308, 2023.
- The relational bottleneck as an inductive bias for efficient abstraction. arXiv preprint arXiv:2309.06629, 2023.
- A review of computational models of basic rule learning: The neural-symbolic debate and beyond. Psychonomic bulletin & review, 26:1174–1194, 2019.
- When can transformers reason with abstract symbols? arXiv preprint arXiv:2310.09753, 2023.
- Gary F Marcus. Rethinking eliminative connectionism. Cognitive psychology, 37(3):243–282, 1998.
- Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Some intriguing aspects about lipschitz continuity of neural networks. arXiv preprint arXiv:2302.10886, 2023.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Rule learning by seven-month-old infants. Science, 283(5398):77–80, 1999.
- Not-so-clevr: learning same–different relations strains feedforward neural networks. Interface focus, 8(4):20180011, 2018.
- Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
- Thomas Serre. Deep learning: the good, the bad, and the ugly. Annual review of vision science, 5:399–426, 2019.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax.
- Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021.
- The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL https://doi.org/10.5281/zenodo.3509134.
- William L. Tong (4 papers)
- Cengiz Pehlevan (81 papers)