Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures (2503.18565v1)

Published 24 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: The current era of NLP is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small LLM (SLM) trained by distilling knowledge from a LLM that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.