Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MLPs Learn In-Context on Regression and Classification Tasks (2405.15618v2)

Published 24 May 2024 in cs.LG and cs.NE

Abstract: In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique haLLMark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context competitively with Transformers given the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging strong prior arguments about MLPs' limited ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs, and support the growing interest in all-MLP alternatives to task-specific architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  4. Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
  5. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  6. What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, 2024.
  7. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
  8. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
  9. Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In International Conference on Learning Representations, 2024.
  10. Asymptotic theory of in-context learning by linear attention. arXiv preprint arXiv:2405.11751, 2024.
  11. Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36, 2024.
  12. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  13. Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1):38, 2019.
  14. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  15. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems, 36, 2024.
  16. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  17. Relational constraints on neural networks reproduce human biases towards abstract geometric regularity. arXiv preprint arXiv:2309.17363, 2023.
  18. Burrhus Frederic Skinner. Are theories of learning necessary? Psychological review, 57(4):193, 1950.
  19. Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human singularity. Proceedings of the National Academy of Sciences, 118(16):e2023123118, 2021.
  20. Pay attention to mlps. Advances in neural information processing systems, 34:9204–9215, 2021.
  21. pnlp-mixer: An efficient all-mlp architecture for language. arXiv preprint arXiv:2202.04350, 2022.
  22. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  23. How many pretraining tasks are needed for in-context learning of linear regression? In International Conference on Learning Representations, 2024.
  24. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36, 2024.
  25. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pages 19565–19594. PMLR, 2023.
  26. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2021.
  27. In-context learning through the bayesian prism. In International Conference on Learning Representations, 2024.
  28. Exploring the relationship between model architecture and in-context learning ability. In International Conference on Learning Representations, 2024.
  29. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
  30. Relational reasoning and generalization using nonsymbolic neural networks. Psychological Review, 130(2):308, 2023.
  31. The relational bottleneck as an inductive bias for efficient abstraction. arXiv preprint arXiv:2309.06629, 2023.
  32. A review of computational models of basic rule learning: The neural-symbolic debate and beyond. Psychonomic bulletin & review, 26:1174–1194, 2019.
  33. When can transformers reason with abstract symbols? arXiv preprint arXiv:2310.09753, 2023.
  34. Gary F Marcus. Rethinking eliminative connectionism. Cognitive psychology, 37(3):243–282, 1998.
  35. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
  36. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  37. Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004.
  38. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  39. Some intriguing aspects about lipschitz continuity of neural networks. arXiv preprint arXiv:2302.10886, 2023.
  40. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  41. Rule learning by seven-month-old infants. Science, 283(5398):77–80, 1999.
  42. Not-so-clevr: learning same–different relations strains feedforward neural networks. Interface focus, 8(4):20180011, 2018.
  43. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
  44. Thomas Serre. Deep learning: the good, the bad, and the ugly. Annual review of vision science, 5:399–426, 2019.
  45. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  46. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  49. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax.
  50. Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021.
  51. The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL https://doi.org/10.5281/zenodo.3509134.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. William L. Tong (4 papers)
  2. Cengiz Pehlevan (81 papers)
Citations (2)

Summary

In-context Learning Beyond Transformers: An Evaluation of Multi-Layer Perceptrons

The paper presents an in-depth investigation into the capabilities of multi-layer perceptrons (MLPs) concerning in-context learning (ICL), a task paradigm traditionally considered a haLLMark of Transformer models. The findings challenge the common belief that ICL competencies are exclusive to attention-based architectures. MLPs, as well as MLP-Mixer models, exhibit competitive in-context learning abilities given the same compute budget as Transformers. Notably, MLPs even outperform Transformers on a subset of tasks designed to test relational reasoning.

Key Contributions

  1. Demonstration of In-context Learning in MLPs: The authors successfully show that MLPs can perform in-context learning similarly to Transformers, suggesting that the ability is not unique to attention-based models. This finding aligns with the universal approximation capability of MLPs, now extended to in-context scenarios.
  2. Superior Relational Reasoning: MLPs outperform Transformers on relational reasoning tasks, challenging the narrative that more sophisticated architectures with stronger inductive biases are always better suited for complex cognitive tasks.
  3. Less Inductive Bias is Better: The paper underscores the concept that models with weaker inductive biases, such as MLPs, can outperform those with stronger biases as data and compute resources grow. This observation supports the broader "bitter lesson" heuristic which posits that general methods tend to win out as compute increases.

Experiments and Results

The authors conduct a series of controlled experiments to test the ICL capabilities of MLPs and Transformers on tasks traditionally seen as benchmarks for ICL.

In-context Regression and Classification

  1. ICL Regression: MLPs and MLP-Mixer models achieve near-optimal mean squared error (MSE) comparable to Transformers on a series of ICL regression tasks. Although MLPs show deterioration with an increasing number of context points, the MLP-Mixer remains robust, highlighting the potential for architectures derived from MLPs.
  2. ICL Classification: In classification tasks, MLPs and Transformers both transition from in-weight learning (IWL) to ICL as data diversity increases. MLPs display competitive performance with Transformers, efficiently handling different lengths of context exemplars.

Relational Tasks

The paper explores relational reasoning, an advanced subset of ICL classification tasks used to probe higher-order cognitive processing. In these tasks, MLPs not only match but often outperform Transformers.

  1. Match-to-Sample: MLPs achieve lower computational loss than Transformers, even demonstrating robust performance under out-of-distribution conditions.
  2. Sphere and Line Oddball Tasks: On tasks requiring relational reasoning, MLPs excel, generalizing better in out-of-distribution tests than Transformers. Specific architectural modifications, like relationally bottlenecked MLPs, further improve performance, but only when relations align well with task structure.

Discussion and Implications

The findings provide compelling evidence that ICL and relational reasoning can be efficiently performed by MLP architectures. This challenges existing assumptions about the necessity of attention mechanisms for such tasks. The demonstrated capabilities of MLPs suggest potential practical advantages, encouraging further exploration into their utility over more complex, inductively biased models like Transformers.

The paper aligns with the heuristic that "less inductive bias is better," especially as compute and data continue to grow. Future research should examine MLPs' performance on more complex datasets and under data-limited conditions to understand the scalability and limitations of these findings.

Conclusion

This paper contributes significantly to the understanding of in-context learning and relational reasoning by simple neural networks. The results promote a broader perspective for exploring alternative architectures to Transformers for ICL tasks. By illustrating that MLPs can indeed learn in-context and perform sophisticated relational reasoning, the paper opens new avenues for further research into efficient and generalizable AI models.