TasNet: time-domain audio separation network for real-time, single-channel speech separation (1711.00541v2)

Published 1 Nov 2017 in cs.SD, cs.LG, cs.MM, cs.NE, and eess.AS

Abstract: Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunication devices.

Authors (2)

Yi Luo (153 papers)
Nima Mesgarani (45 papers)

Citations (604)

View on Semantic Scholar

Summary

The paper introduces a novel time-domain method that bypasses traditional STFT, effectively eliminating phase modification issues.
It employs an encoder-decoder architecture with nonnegative weight estimation to enhance SI-SNR and SDR in speech separation.
Real-time performance is achieved with a latency as low as 5.23 ms, making TasNet ideal for latency-sensitive applications.

Time-domain Audio Separation Network (TasNet) for Real-time Speech Separation

The paper "TasNet: time-domain audio separation network for real-time, single-channel speech separation" by Yi Luo and Nima Mesgarani presents an innovative approach to speech separation in multi-talker environments using TasNet, a time-domain audio separation network. Traditional methods relying on time-frequency (T-F) representations, such as Short-Time Fourier Transform (STFT), encounter limitations, particularly in dealing with phase information and latency requirements. TasNet addresses these issues by directly processing the raw waveform, eliminating the need for frequency decomposition.

Key Contributions

The authors propose an encoder-decoder framework to separate speech directly in the time domain. The method involves:

Time-domain Modeling: Unlike conventional techniques, TasNet models the audio signal without converting it into frequency components, thereby avoiding phase modification issues.
Encoder-Decoder Architecture: It utilizes an encoder to convert the audio waveform into a weighted representation using a set of learned basis signals and a decoder to reconstruct the separated sources from these weights.
Nonnegative Weights: Source separation is achieved by estimating nonnegative weights that act as masks for the audio mixture, reminiscent of T-F mask generation in STFT-based systems.
Low Latency Implementation: By reducing the segment length to as little as 5 ms, TasNet achieves real-time performance, making it suitable for applications like telecommunications and hearables.

Strong Numerical Results

Experiments conducted on the WSJ0-2mix dataset demonstrate TasNet’s superiority. The authors report:

A marked improvement in SI-SNR and SDR for both causal and noncausal configurations, outperforming state-of-the-art STFT-based methods.
Causal TasNet achieves a total latency of 5.23 ms, significantly lower than the >32 ms latency seen in STFT approaches.

Implications and Future Directions

TasNet's approach offers profound implications for low-latency, high-performance speech separation. By eliminating the T-F transformation step, it opens new possibilities for real-time applications in resource-constrained environments.

Future research may explore the theoretical underpinnings of TasNet's ability to learn time-domain bases. Additionally, extending the framework to handle varying channel conditions and integrating the system with other signal enhancement technologies could further broaden its applicability.

TasNet represents a pivotal step in audio processing, encouraging a shift towards more direct waveform-based methodologies in speech separation and beyond.

PDF Markdown