Overview
State-space models (SSMs) like Mamba have emerged as potential alternatives to Transformer networks for tasks such as LLMing. A paper examines the capabilities of SSMs, particularly Mamba, in in-context learning (ICL) tasks compared to Transformers. The paper also explores a hybrid model named MambaFormer, which incorporates both architectures, aiming to capitalize on the strengths of each.
ICL Performance of SSMs
LLMs, typified by Transformers, are known for their ICL capabilities, where they can execute tasks with minimal examples and no parameter tuning. Despite their efficiency, SSMs have been less studied in this regard. This paper sets out to assess the ICL potential of SSMs, particularly the Mamba model, across a spectrum of tasks. SSMs display competitive ICL performance, aligning with that of Transformers on most tasks. Notably, Mamba excels at sparse parity learning but shows limitations in tasks like vector-valued multi-query associative recall (MQAR), where retrieval is key.
Hybrid Model: MambaFormer
To address SSMs' shortcomings, the research introduces a novel hybrid model, MambaFormer, which merges Mamba with multi-head attention layers from Transformers. Unlike separate models, MambaFormer surpasses them in tasks where they individually fail. For instance, while Mamba shows proficiency in sparse parity learning, where Transformers falter, the hybrid model demonstrates proficiency across all evaluated tasks, including retrieval.
Strong Numerical Results & Key Insights
The paper presents strong numerical results, notably in Table 1, which delineates performance across different models and tasks using a labeling system (✓ for success, ✗ for failure, and ▲ for performance gaps). The MambaFormer achieves the mark of ✓ across all tasks, signaling its all-around proficiency. In complex ICL tasks like decision tree learning and sparse parity, the hybrid model leverages both its components, demonstrating impressive gains over individual architectures.
Conclusion and Future Directions
Concluding that both SSMs and Transformers possess distinct advantages for ICL tasks, the paper proposes that hybrid architectures like MambaFormer should be further explored to enhance the LLMs' ICL capabilities. The researchers acknowledge the limitation of focusing on non-language ICL tasks and smaller model scales but suggest no fundamental hurdle for Mamba's ICL performance. Future research may focus on comparing SSM and Transformer architectures for more generalized ICL tasks in language settings at higher parameters, potentially offering newer insights into the architecture of LLMs.