Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs (2408.16568v3)

Published 29 Aug 2024 in cs.SD and eess.AS

Abstract: While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach for learning audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.

Citations (1)

Summary

  • The paper introduces AxLSTM, a novel mLSTM-based model that reduces parameters by up to 45% while improving performance by up to 20% compared to transformer-based baselines.
  • It employs masked spectrogram patches and a matrix memory cell design to efficiently learn self-supervised audio features.
  • Extensive experiments across ten downstream tasks demonstrate that AxLSTM generalizes well, achieving around a 30% performance improvement over SSAST.

Overview of "Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs"

The paper "Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs" explores the viability of extended long short-term memory (xLSTM) models in learning self-supervised audio representations. Developed by Sarthak Yadav, Sergios Theodoridis, and Zheng-Hua Tan from Aalborg University, the paper explores the current landscape dominated by transformers and proposes xLSTM as a competitive alternative.

Introduction

Transformers have become the dominant models in various machine learning tasks due to their self-attention mechanisms and versatility. However, the quadratic complexity of the attention operation has led researchers to explore models with more efficient computations. xLSTM represents a novel direction in this line of inquiry, intending to mitigate some of the intrinsic limitations of traditional LSTMs, such as memory mixing and the inability to revise storage decisions.

The paper introduces Audio xLSTM (AxLSTM), aiming to learn audio representations through masked spectrogram patches in a self-supervised framework. Pretrained on the comprehensive AudioSet dataset, it demonstrates the superiority of AxLSTM over self-supervised audio spectrogram transformer (SSAST) baselines, achieving up to 45% reduction in parameters while increasing relative performance by up to 20%.

Method

The AxLSTM model extends the capabilities of the xLSTM framework, particularly focusing on the mLSTM block. The mLSTM addresses traditional LSTM limitations by utilizing a matrix memory cell and includes exponential gating for better storage decision revisions. This architecture scales linearly with sequence length and does not require storing the entire key-value cache, which is essential for improving efficiency over transformers.

For AxLSTM, the input audio is converted into spectrogram patches, which are then masked to encourage self-supervised learning. The attention mechanism common in transformers is substituted by mLSTM blocks, which facilitate efficient memory utilization and sequence processing. The model is pretrained using masked modeling of spectrogram patches from the AudioSet dataset.

Experiments

The paper evaluates AxLSTM models against a range of popular audio representation models across ten diverse downstream tasks. The experiments reveal several noteworthy observations:

  1. AxLSTM models consistently outperformed SSAST baselines in terms of both accuracy and parameter efficiency.
  2. Different AxLSTM configurations (Tiny, Small, Base) showed substantial improvements in aggregate performance metrics, fostering higher efficiency.
  3. AxLSTMs demonstrated better generalization across an array of tasks, further validating the applicability of xLSTMs in audio representation learning.

Table comparisons in the paper highlight the performance gains, illustrating that the AxLSTM-Base configuration improves aggregate performance by approximately 30% relative to its transformer-based SSAST counterpart.

Ablations

The paper conducts a series of ablation studies to assess the impact of different architectural choices:

  1. Expansion Factor: The results indicate that while an expansion factor of three improves performance, an expansion factor of four tends to overfit, leading to performance degradation.
  2. Patch Size: Smaller patch sizes yield better performance, suggesting that finer granularity in input representation benefits the model.
  3. Design Choices: AxLSTM performance improves marginally without exponential gating for input and forget gates. Additionally, removing sequence flipping in even-numbered blocks significantly enhances results, underscoring the specificity required for different modalities (audio versus vision).

Conclusion

The AxLSTM model presents a promising approach for learning self-supervised audio representations, leveraging the architectural advances in xLSTMs. The experimental results underscore the model's efficacy, showcasing substantial performance improvements over existing transformer-based models while maintaining lower parameter counts. These findings open avenues for further exploration of recurrent architectures in self-supervised learning paradigms.

Implications and Future Work

The research suggests substantial theoretical and practical implications. Theoretically, it positions xLSTMs as a viable contender against transformers, especially in tasks demanding long sequence modeling and efficient memory usage. Practically, this development could lead to more resource-efficient models deployable on devices with limited computational power.

Future developments could explore hybrid models that incorporate the strengths of both transformers and xLSTMs. Moreover, extending the AxLSTM framework to other domains, such as video or multimodal tasks, could further substantiate the versatility of xLSTMs. Enhanced normalization techniques and optimizations for very large datasets are also prospective areas for significant research contributions.