Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation (1809.07454v3)

Published 20 Sep 2018 in cs.SD, cs.LG, and eess.AS

Abstract: Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications.

Authors (2)

Yi Luo (153 papers)
Nima Mesgarani (45 papers)

Citations (1,648)

View on Semantic Scholar

Summary

Overview of Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

The paper presents a pioneering approach to the problem of single-channel, speaker-independent speech separation by introducing Conv-TasNet, a fully convolutional time-domain audio separation network. This work addresses considerable limitations observed in the conventional time-frequency (T-F) domain methods by entirely bypassing the use of T-F representations.

Key Contributions

The paper outlines several pivotal contributions made by Conv-TasNet:

Time-Domain Separation: Unlike traditional methods that rely on T-F transformations such as STFT, which decouple phase and magnitude, leading to potential inaccuracies, Conv-TasNet operates in the time domain, avoiding these transformations and their associated drawbacks.
Fully-Convolutional Network: Conv-TasNet introduces a fully-convolutional network model employing Temporal Convolutional Networks (TCNs) with dilated convolutions, enabling the capture of long-term dependencies in the speech signal while maintaining a manageable model size.
Model Efficiency and Latency: The proposed model demonstrates substantial improvements in model efficiency and processing latency, making it suitable for low-resource and real-time applications.
Surpassing Ideal Masks: Conv-TasNet transcends even ideal T-F magnitude masks, such as IBM, IRM, and WFM, in both objective distortion measures (SI-SNRi, SDRi) and subjective quality assessments.

Detailed Analysis

Encoder-Decoder Architecture

Conv-TasNet employs an encoder that transforms the raw waveform into an intermediate representation optimized for separation. The separation process then uses masks estimated by the TCN to extract individual speaker representations. Finally, a decoder reconstructs the waveforms from the masked features. This three-stage process leverages overcomplete representations for superior separation performance.

Temporal Convolutional Networks

The core of Conv-TasNet’s separation module is the TCN, utilizing stacked 1-D dilated convolutions. This design choice is crucial as it replaces the deep LSTM networks previously used in TasNet, addressing issues of:

Temporal Dependency: TCNs capture extensive temporal contexts via dilation.
Scalability and Generalizability: Convolutional structures enhance parallel processing capabilities and reduce model complexity.
Efficient Computation: Depthwise separable convolutions further reduce the number of parameters and computational cost, crucial for deployment in wearable and real-time systems.

Experimental Validation

Performance Metrics

Extensive experiments on standard datasets (WSJ0-2mix and WSJ0-3mix) were conducted, with assessments based on both SI-SNRi and SDRi metrics. The results are compelling:

Conv-TasNet achieved SI-SNRi of 15.3 dB and SDRi of 15.6 dB on WSJ0-2mix, surpassing existing methods and ideal T-F mask baselines.
For three-speaker separation (WSJ0-3mix), it maintained significant performance margins over competing STFT-based methods.

Subjective and Objective Quality

In addition to numerical metrics, Conv-TasNet’s superiority was validated through subjective listening tests, demonstrating higher mean opinion scores (MOS) compared to IRM. Objective quality measures like PESQ also confirmed Conv-TasNet’s robustness.

Implications and Future Work

Practical Implications

The advancements introduced by Conv-TasNet result in a highly promising solution for real-world speech processing applications:

Offline and Real-time Processing: Shorter latency and smaller model size make it feasible for embedded and wearable devices, enhancing hearing aids and telecommunication systems.
Comprehensive Evaluation: The robustness to noise and reverberation, scalability to multiple speakers, and consistency in performance across different input starting points make it a versatile tool for diverse environments.

Theoretical Contributions

The success of the fully-convolutional, time-domain approach invites further exploration in:

Optimized Representations: The basis functions learned by Conv-TasNet closely resemble auditory signal processing pathways, hinting at possible bio-mimetic designs.
Sparse Coding and Overcompleteness: The emphasis on overcomplete representations parallels sparse coding strategies, opening avenues for deep integration with these paradigms.

Future Directions

Future research may focus on:

Multichannel Input: Extending Conv-TasNet to leverage multichannel inputs could further enhance its robustness in challenging acoustic environments.
Long-term Speaker Tracking: Incorporating mechanisms to handle long temporal interruptions and speaker variabilities can enhance long-term tracking capabilities.
Adaptive Noise Handling: Investigating the network’s performance under varying noise conditions and improving its adaptability to different noise profiles remains a valuable pursuit.

Conclusion

Conv-TasNet delineates a significant progression in the speech separation domain, offering an efficient, scalable, and higher-performing alternative to T-F domain methods. Its prowess in leveraging time-domain convolutions and achieving unparalleled results fosters a new trajectory in the design and application of speech separation systems. The insights drawn from Conv-TasNet’s architecture and performance will undeniably influence future research and development in both academia and industry.

PDF Markdown

Related Papers

Find Related Papers