MLPs Learn In-Context on Regression and Classification Tasks (2405.15618v2)

Published 24 May 2024 in cs.LG and cs.NE

Abstract: In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique haLLMark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context competitively with Transformers given the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging strong prior arguments about MLPs' limited ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs, and support the growing interest in all-MLP alternatives to task-specific architectures.

References (51)

Authors (2)

William L. Tong (4 papers)
Cengiz Pehlevan (81 papers)

Citations (2)

View on Semantic Scholar

Summary

In-context Learning Beyond Transformers: An Evaluation of Multi-Layer Perceptrons

The paper presents an in-depth investigation into the capabilities of multi-layer perceptrons (MLPs) concerning in-context learning (ICL), a task paradigm traditionally considered a haLLMark of Transformer models. The findings challenge the common belief that ICL competencies are exclusive to attention-based architectures. MLPs, as well as MLP-Mixer models, exhibit competitive in-context learning abilities given the same compute budget as Transformers. Notably, MLPs even outperform Transformers on a subset of tasks designed to test relational reasoning.

Key Contributions

Demonstration of In-context Learning in MLPs: The authors successfully show that MLPs can perform in-context learning similarly to Transformers, suggesting that the ability is not unique to attention-based models. This finding aligns with the universal approximation capability of MLPs, now extended to in-context scenarios.
Superior Relational Reasoning: MLPs outperform Transformers on relational reasoning tasks, challenging the narrative that more sophisticated architectures with stronger inductive biases are always better suited for complex cognitive tasks.
Less Inductive Bias is Better: The paper underscores the concept that models with weaker inductive biases, such as MLPs, can outperform those with stronger biases as data and compute resources grow. This observation supports the broader "bitter lesson" heuristic which posits that general methods tend to win out as compute increases.

Experiments and Results

The authors conduct a series of controlled experiments to test the ICL capabilities of MLPs and Transformers on tasks traditionally seen as benchmarks for ICL.

In-context Regression and Classification

ICL Regression: MLPs and MLP-Mixer models achieve near-optimal mean squared error (MSE) comparable to Transformers on a series of ICL regression tasks. Although MLPs show deterioration with an increasing number of context points, the MLP-Mixer remains robust, highlighting the potential for architectures derived from MLPs.
ICL Classification: In classification tasks, MLPs and Transformers both transition from in-weight learning (IWL) to ICL as data diversity increases. MLPs display competitive performance with Transformers, efficiently handling different lengths of context exemplars.

Relational Tasks

The paper explores relational reasoning, an advanced subset of ICL classification tasks used to probe higher-order cognitive processing. In these tasks, MLPs not only match but often outperform Transformers.

Match-to-Sample: MLPs achieve lower computational loss than Transformers, even demonstrating robust performance under out-of-distribution conditions.
Sphere and Line Oddball Tasks: On tasks requiring relational reasoning, MLPs excel, generalizing better in out-of-distribution tests than Transformers. Specific architectural modifications, like relationally bottlenecked MLPs, further improve performance, but only when relations align well with task structure.

Discussion and Implications

The findings provide compelling evidence that ICL and relational reasoning can be efficiently performed by MLP architectures. This challenges existing assumptions about the necessity of attention mechanisms for such tasks. The demonstrated capabilities of MLPs suggest potential practical advantages, encouraging further exploration into their utility over more complex, inductively biased models like Transformers.

The paper aligns with the heuristic that "less inductive bias is better," especially as compute and data continue to grow. Future research should examine MLPs' performance on more complex datasets and under data-limited conditions to understand the scalability and limitations of these findings.

Conclusion

This paper contributes significantly to the understanding of in-context learning and relational reasoning by simple neural networks. The results promote a broader perspective for exploring alternative architectures to Transformers for ICL tasks. By illustrating that MLPs can indeed learn in-context and perform sophisticated relational reasoning, the paper opens new avenues for further research into efficient and generalizable AI models.