- The paper introduces AxLSTM, a novel mLSTM-based model that reduces parameters by up to 45% while improving performance by up to 20% compared to transformer-based baselines.
- It employs masked spectrogram patches and a matrix memory cell design to efficiently learn self-supervised audio features.
- Extensive experiments across ten downstream tasks demonstrate that AxLSTM generalizes well, achieving around a 30% performance improvement over SSAST.
Overview of "Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs"
The paper "Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs" explores the viability of extended long short-term memory (xLSTM) models in learning self-supervised audio representations. Developed by Sarthak Yadav, Sergios Theodoridis, and Zheng-Hua Tan from Aalborg University, the paper explores the current landscape dominated by transformers and proposes xLSTM as a competitive alternative.
Introduction
Transformers have become the dominant models in various machine learning tasks due to their self-attention mechanisms and versatility. However, the quadratic complexity of the attention operation has led researchers to explore models with more efficient computations. xLSTM represents a novel direction in this line of inquiry, intending to mitigate some of the intrinsic limitations of traditional LSTMs, such as memory mixing and the inability to revise storage decisions.
The paper introduces Audio xLSTM (AxLSTM), aiming to learn audio representations through masked spectrogram patches in a self-supervised framework. Pretrained on the comprehensive AudioSet dataset, it demonstrates the superiority of AxLSTM over self-supervised audio spectrogram transformer (SSAST) baselines, achieving up to 45% reduction in parameters while increasing relative performance by up to 20%.
Method
The AxLSTM model extends the capabilities of the xLSTM framework, particularly focusing on the mLSTM block. The mLSTM addresses traditional LSTM limitations by utilizing a matrix memory cell and includes exponential gating for better storage decision revisions. This architecture scales linearly with sequence length and does not require storing the entire key-value cache, which is essential for improving efficiency over transformers.
For AxLSTM, the input audio is converted into spectrogram patches, which are then masked to encourage self-supervised learning. The attention mechanism common in transformers is substituted by mLSTM blocks, which facilitate efficient memory utilization and sequence processing. The model is pretrained using masked modeling of spectrogram patches from the AudioSet dataset.
Experiments
The paper evaluates AxLSTM models against a range of popular audio representation models across ten diverse downstream tasks. The experiments reveal several noteworthy observations:
- AxLSTM models consistently outperformed SSAST baselines in terms of both accuracy and parameter efficiency.
- Different AxLSTM configurations (Tiny, Small, Base) showed substantial improvements in aggregate performance metrics, fostering higher efficiency.
- AxLSTMs demonstrated better generalization across an array of tasks, further validating the applicability of xLSTMs in audio representation learning.
Table comparisons in the paper highlight the performance gains, illustrating that the AxLSTM-Base configuration improves aggregate performance by approximately 30% relative to its transformer-based SSAST counterpart.
Ablations
The paper conducts a series of ablation studies to assess the impact of different architectural choices:
- Expansion Factor: The results indicate that while an expansion factor of three improves performance, an expansion factor of four tends to overfit, leading to performance degradation.
- Patch Size: Smaller patch sizes yield better performance, suggesting that finer granularity in input representation benefits the model.
- Design Choices: AxLSTM performance improves marginally without exponential gating for input and forget gates. Additionally, removing sequence flipping in even-numbered blocks significantly enhances results, underscoring the specificity required for different modalities (audio versus vision).
Conclusion
The AxLSTM model presents a promising approach for learning self-supervised audio representations, leveraging the architectural advances in xLSTMs. The experimental results underscore the model's efficacy, showcasing substantial performance improvements over existing transformer-based models while maintaining lower parameter counts. These findings open avenues for further exploration of recurrent architectures in self-supervised learning paradigms.
Implications and Future Work
The research suggests substantial theoretical and practical implications. Theoretically, it positions xLSTMs as a viable contender against transformers, especially in tasks demanding long sequence modeling and efficient memory usage. Practically, this development could lead to more resource-efficient models deployable on devices with limited computational power.
Future developments could explore hybrid models that incorporate the strengths of both transformers and xLSTMs. Moreover, extending the AxLSTM framework to other domains, such as video or multimodal tasks, could further substantiate the versatility of xLSTMs. Enhanced normalization techniques and optimizations for very large datasets are also prospective areas for significant research contributions.