Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks (2402.04248v2)

Published 6 Feb 2024 in cs.LG

Abstract: State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in LLMing, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern LLMs that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in LLMs.

PDF Abstract

Overview

State-space models (SSMs) like Mamba have emerged as potential alternatives to Transformer networks for tasks such as LLMing. A paper examines the capabilities of SSMs, particularly Mamba, in in-context learning (ICL) tasks compared to Transformers. The paper also explores a hybrid model named MambaFormer, which incorporates both architectures, aiming to capitalize on the strengths of each.

ICL Performance of SSMs

LLMs, typified by Transformers, are known for their ICL capabilities, where they can execute tasks with minimal examples and no parameter tuning. Despite their efficiency, SSMs have been less studied in this regard. This paper sets out to assess the ICL potential of SSMs, particularly the Mamba model, across a spectrum of tasks. SSMs display competitive ICL performance, aligning with that of Transformers on most tasks. Notably, Mamba excels at sparse parity learning but shows limitations in tasks like vector-valued multi-query associative recall (MQAR), where retrieval is key.

Hybrid Model: MambaFormer

To address SSMs' shortcomings, the research introduces a novel hybrid model, MambaFormer, which merges Mamba with multi-head attention layers from Transformers. Unlike separate models, MambaFormer surpasses them in tasks where they individually fail. For instance, while Mamba shows proficiency in sparse parity learning, where Transformers falter, the hybrid model demonstrates proficiency across all evaluated tasks, including retrieval.

Strong Numerical Results & Key Insights

The paper presents strong numerical results, notably in Table 1, which delineates performance across different models and tasks using a labeling system (✓ for success, ✗ for failure, and ▲ for performance gaps). The MambaFormer achieves the mark of ✓ across all tasks, signaling its all-around proficiency. In complex ICL tasks like decision tree learning and sparse parity, the hybrid model leverages both its components, demonstrating impressive gains over individual architectures.

Conclusion and Future Directions

Concluding that both SSMs and Transformers possess distinct advantages for ICL tasks, the paper proposes that hybrid architectures like MambaFormer should be further explored to enhance the LLMs' ICL capabilities. The researchers acknowledge the limitation of focusing on non-language ICL tasks and smaller model scales but suggest no fundamental hurdle for Mamba's ICL performance. Future research may focus on comparing SSM and Transformer architectures for more generalized ICL tasks in language settings at higher parameters, potentially offering newer insights into the architecture of LLMs.