xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement (2501.06146v2)

Published 10 Jan 2025 in cs.SD, cs.AI, and eess.AS

Abstract: While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems of similar complexity on the Voicebank+DEMAND dataset.

Summary

The paper introduces xLSTM-SENet, an innovative framework integrating the xLSTM architecture for single-channel speech enhancement to address scalability issues of attention models like Conformers.
Key findings show xLSTM-SENet matches or outperforms established Mamba and Conformer models on the VoiceBank+Demand dataset across metrics like PESQ, CSIG, and COVL.
From a practical perspective, xLSTM-SENet offers a scalable and effective solution for real-world applications such as hearing aids and ASR, requiring bandwidth efficiency and low computational overhead.

Overview of xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement

The paper "xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement" introduces an innovative approach centered on the implementation of the Extended Long Short-Term Memory (xLSTM) architecture for the task of single-channel speech enhancement. The research specifically addresses the limitations faced by prevalent attention-based architectures, such as Conformers, which, despite their high performance, struggle with scalability issues concerning input sequence length.

The proposed framework, xLSTM-SENet, employs xLSTM—a noteworthy evolution from traditional LSTM models designed to mitigate common LSTM constraints by leveraging exponential gating, matrix memory, and eliminating memory mixing. The paper marks a pioneering attempt to explore the utility of xLSTM within the speech enhancement domain, particularly focusing on the denoising and intelligibility improvement of speech signals obscured by noise interferences.

Methodology and Key Findings

The paper outlines the integration of xLSTM into the MP-SENet architecture, swapping existing Conformer components with xLSTM blocks. This approach retains the encoder-decoder structural framework while adopting a dual-path mechanism for improving both magnitude and phase spectra of speech signals. The system is evaluated on the VoiceBank+Demand dataset, which is a well-known benchmark in the speech enhancement community.

Key numerical results from the paper display that the proposed xLSTM-SENet not only matches but also outperforms established Mamba and Conformer-based models across various performance metrics, such as PESQ, CSIG, and COVL, achieving particularly strong scores on the VoiceBank+Demand dataset. The empirical performance is attributed to the architectural innovations like exponential gating and bidirectionality, underscoring their significant role in enhancing system capability.

A comparative exploration with traditional LSTM highlights that while xLSTM provides notable improvements due to its additional features such as matrix memory, a carefully designed LSTM (block) configuration can achieve similar efficacy levels. This opens a dialogue on the relative advantages and possible redundancies between these configurations within specific applications.

Practical Implications

From a practical perspective, xLSTM-SENet presents a scalable and effective model for real-world applications that necessitate bandwidth efficiency and low computational overhead, such as hearing aid devices and automatic speech recognition systems operating in noisy environments. The linear scalability and reduced memory requirements of xLSTM offer a sustainable pathway for deploying sophisticated speech enhancement systems in resource-constrained settings.

Implications for Future Research

The introduction of xLSTM into speech enhancement catalyzes several avenues for further inquiry. Potential research directions include expanding the application of xLSTM to multi-channel setups, exploring hybrid architectures that combine elements of attention-based models for further refinement, and addressing latency concerns for real-time applications. Furthermore, as the field moves towards increasingly data-driven models, investigating efficient training methodologies for xLSTM architectures remains an essential pursuit.

The paper culminates in the release of xLSTM-SENet2, which further optimizes the xLSTM configuration, demonstrating superior results over prior state-of-the-art models. This reinforces the potential of xLSTM architectures as a viable enhancement to current models, advocating their broader adoption and adaptation in advanced artificial intelligence systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HochreiterSepp/status/1878700411479281829

https://twitter.com/arxivsanitybot/status/1878797379429884355